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CHAPTER  1 


INTRODUCTION 


1.1.  Motivation  and  Research  Objective 

There  is  unquestionably  a  need  for  high-speed  general-purpose  computer  systems: 
the  range  of  applications  handled  by  computers  as  well  as  the  volume  of  data  processed 
by  computers  is  continuously  increasing.  Advances  in  circuit  technology  have  resulted 
in  dramatic  performance  improvements.  However,  circuit  technology  advances  alone 
have  proven  insufficient  in  satisfying  the  increasing  demand  for  higher  performance. 

By  providing  a  high  degree  of  concurrency  through  parallelism  and  pipelining, 
modern  supercomputers  are  capable  of  delivering  significantly  higher  performance  for 
applications  dominated  by  numerical  computations.  In  contrast,  the  extensive  use  of 
concurrency  to  achieve  higher  performance  for  applications  dominated  by  nonnumeric 
or  symbolic  computations  has  met  with  only  limited  viccess. 

The  objective  of  this  research  is  to  investigate  new  techniques  that  use  con¬ 
currency  to  improve  the  performance  of  nonnumeric/symbolic  computation-intensive 
applications.  The  approach  taken  by  this  research  is  an  integrated  design  philosophy  in 
which  the  machine  organization  and  instruction  set  architecture  are  developed  in  con¬ 
junction  with  the  development  of  the  compiler's  code  generation  strategy. 

1.2.  Overview  of  this  Work 


■  t 


This  work  is  concerned  with  achieving  highly  concurrent  processing  of  scalar  code, 
i.e.  code  without  vector  instructions.  Concurrency  is  most  easily  obtained  for  code  that 
is  readily  vectorizable.  Inherently  scalar  code,  i.e.  code  that  cannot  be  vectorized. 


causes  severe  performance  degradation  in  most  concurrent  machines.  Such  code  occurs 
to  some  extent  in  all  applications  and  dominates  in  nonnumeric  and  symbolic  applica¬ 
tions.  Inherently  scalar  code  is  often  characterized  by 

(i)  prolific  use  of  data -dependent  conditional  branches  with  very  little  computation 
between  successive  branches,  and 

(ii)  use  of  linked  data  structures  with  memory  address  pointers,  as  opposed  to  array 
data  structures  with  integer  indices. 

We  have  developed  compiler  code  generation  techniques,  architectural  support  features, 
and  a  machine  organization  that  specifically  addresses  the  problems  of  highly  con¬ 
current  scalar  computation. 

Chapter  2  focuses  on  the  problem  of  conditional  branches.  In  this  chapter  we  pro¬ 
pose  a  code  generation  heuristic,  called  the  decision  tree  scheduling  (DTS)  technique, 
that  performs  extensive  code  rearrangement  over  a  complex  of  basic  blocks  to  achieve 
high  levels  of  speedup.  The  DTS  technique  is  based  on  a  software  implementation  of 
well-known  hardware  speedup  techniques  for  instruction  pipelines,  including  out-of- 
order  execution,  branch  prediction,  and  branch  lookahead  with  conditional  execution. 

To  support  the  DTS  technique  we  propose  an  architectural  concept  called  guarded 
instructions.  Guarded  store  instructions  enhance  a  compiler’s  ability  to  reorder  loads 
and  stores,  thus  increasing  the  average  level  of  concurrency.  Guarded  jump  instructions 
allow  the  execution  of  conditional  branches  to  overlap,  significantly  reducing  the  aver¬ 
age  time  per  transfer  of  control. 

The  DTS  technique  is  most  useful  for  highly  concurrent  architectures:  both  paral¬ 
lelism  and  pipelining  can  be  exploited  efficiently.  We  present  performance  evaluation  of 
the  DTS  technique  based  on  realistic  but  problematic  workloads  drawn  from  the  UNIX 
kernel  and  other  sources.  This  evaluation  is  performed  by  evaluating  concurrency 


relative  to  a  Cray-l-like  scalar  unit  by  considering  a  pipelined  processor  with  similar 
timing  and  the  capability  of  issuing  one  or  more  instructions  per  clock  cycle. 

Chapter  3  focuses  on  the  problem  of  code  generation  for  program  loops,  with  a 
running  example  that  manipulates  linked  data  structures.  Because  the  traversal  of 
linked  data  structures  introduces  special  problems,  conventional  loop-speedup  tech¬ 
niques  such  as  vectorization  and  multitasking  are  shown  to  be  unusable  and/or  less 
cost-effective.  In  this  chapter  we  present  a  code  generation  algorithm,  called  the  simple 
loop  scheduling  (SLS)  algorithm,  that  generates  throughput-optimal  loop  code  for  a  class 
of  loops  that  does  not  contain  nested  conditional  statements.  The  significance  of 
throughput-optimal  loop  code  is  that,  while  the  loop  is  executing  in  steady-state,  peak 
performance  is  continuously  maintained. 

Chapters  2  and  3  deal  primarily  with  compiler-based  code  optimization  tech¬ 
niques.  These  code  optimization  techniques  were  developed  based  on  a  fairly  detailed 
machine  model.  Chapter  4  describes  the  machine  model  and  discusses  implementation 
considerations  that  motivated  the  particular  choice  of  machine  organization.  Several 
machine  dependent  code  generation  issues,  including  the  problem  of  register  allocation, 
are  also  discussed  in  this  chapter. 

The  main  results  of  this  research  are  summarized  in  chapter  5.  In  this  chapter  we 
also  present  some  suggestions  for  future  research  in  this  area. 


CHAPTER  2 


SCHEDULING  SCALAR  CODE 


2.1.  Introduction 

High  speed  scalar  processing  is  an  essential  characteristic  of  high  performance  gen¬ 
eral  purpose  computer  systems.  Pipelined  instruction  execution  is  the  standard  method 
for  increasing  scalar  performance  beyond  performance  levels  achievable  by  fast  logic 
technology  alonefl.  2].  Unfortunately,  the  potential  throughput  of  instruction  pipelines 
is  rarely  achieved  because  scalar  code  usually  contains  many  data  dependencies  and 
conditional  branches.  These  disrupt  the  smooth  flow  of  instructions  which  causes  the 
pipeline  to  be  underutilized,  leading  to  performance  degradation. 

Many  techniques  have  been  proposed  for  improving  the  throughput  of  instruction 
pipelines.  Well  known  techniques  include  out-of-order  execution[3],  branch  predic- 
tion[4],  and  branch  lookahead  (conditional  issue  of  further  instructions  while  awaiting 
branch  outcomes)[2].  Although  effective,  implementing  these  techniques  via  hardware 
increases  the  complexity  of  pipeline  control  and  usually  causes  the  clock  cycle  to  be 
lengthened  as  well.  Thus  the  performance  advantage  gained  by  increasing  the  pipeline 
utilization  by  such  techniques  is  degraded  by  both  an  increase  in  cost  and  a  reduction  in 
the  clock  speed. 

In  this  chapter,  we  propose  a  software  implementation  of  these  techniques,  with 
modest  hardware  support,  that  minimizes  these  disadvantages.  This  approach  relies  on 
an  integrated  design  philosophy  in  which  the  machine  architecture  is  developed  in  con¬ 
junction  with  the  development  of  the  compiler’s  code  generation  strategy.  This  idea 
represents  an  extension  of  similar  philosophies  reported  in  the  literature  for  processors 


with  very  short  instruction  pipelines  (in  the  range  of  two  to  four  seg- 
mentsX5>  6,  7, 8. 9]. 


The  ideas  of  this  chapter  naturally  extend  to  much  longer  instruction  pipelines 
and  multiple-pipeline  parallel  architectures.  The  potential  throughput  of  an  instruction 
pipeline  is  increased  by  partitioning  the  instruction  pipeline  into  more  segments  with 
finer  granularity,  thereby  increasing  the  length  of  the  pipeline  but  speeding  up  the 
clock.  Delivered  performance  rarely  approaches  the  potentially  higher  instruction  issue 
rate  of  longer  pipelines  due  to  the  increased  difficulty  of  efficiently  utilizing  a  longer 
pipeline.  The  instruction  set  modifications,  their  hardware  support,  and  the  code  optim¬ 
ization  techniques  proposed  here  are  most  useful  for  such  highly-concurrent  scalar 
architectures. 

Z2.  Motivation  for  the  Scalar  Processing  Problem 

Program  structure  is  perhaps  best  characterized  by  a  program  graph  whose  nodes 
represent  basic  blocks  and  whose  arcs  represent  control  Sow  from  block  to  block.  A 
basic  block  is  a  maximal  set  of  instructions  such  that  every  instruction  in  the  block  is 
executed  exactly  once  each  time  the  block  is  entered.  Efficient  concurrent  execution  is 
difficult  to  achieve  since  basic  blocks  typically 

(i)  have  few  instructions.  e.g.  three  to  six[4, 10]. 

(ii)  have  internal  data  dependencies[ll].  and 

(iii)  have  a  branch  instruction  at  the  end[l2]. 

Scalar  code  optimization  is  typically  performed  at  the  block  level,  but  little  optimiza¬ 
tion  can  be  performed  with  few  instructions.  Data  dependencies  limit  concurrency  in 
pipelines.  Branch  instructions  have  embedded  dependencies  in  the  tests  for  conditional 
branches  and  create  delay  or  uncertainty  in  selecting  the  next  block  for  execution. 
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Thus  effective  optimization  techniques  must  consider  a  complex  of  multiple 
blocks.  A  convenient  representation  of  such  a  complex  is  a  decision  tree.  Without  loss 
of  generality,  we  consider  binary  decision  trees  of  basic  blocks,  where  each  interior 
block  terminates  in  a  two-way  conditional  branch  and  each  exterior  block  terminates  in 
an  unconditional  branch  to  the  root  of  another  decision  tree. 

Consider  the  binary  decision  tree  shown  in  figure  2.1.  written  in  the  language  C. 
This  simple  example  is  representative  of  many  nonnumeric  programs  in  that  conditional 


struct  { 

}  *x.  *y: 


int  A.  B.  C; 


if(  x  ~*A  <■  1  )  { 

if(  x  —B  <«■  8  )  { 

y  ~*C  -  10: 
goto  Z ; 

}  else  { 

x  —C  -9: 
goto  T ; 


}  else  { 


if(  y  -*A  <■  2  )  { 

*  -C  -  7  ; 
goto  X: 

}  else  { 

x  ->B  -  3: 

if(  y  -*B  <-4){ 

y-*C  -6; 
goto  W: 

}  else  { 

x  -C  -  J  : 
goto  V; 


Figure  2.1.  Example  of  decision  tree  in  source  language. 
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statements  are  frequently  nested  and  assignment  statements  are  extremely  simple. 
Although  in  this  example  the  leaves  of  the  decision  tree  terminate  in  goto  statements, 
they  could  also  represent  procedure  calls. 

In  order  to  discuss  performance  issues,  this  source  level  program  fragment  must  be 
compiled  into  machine  language.  We  chose  to  use  a  load/store  architec- 
ture[l.  5, 6.  7, 8, 9]  because,  by  explicitly  separating  memory  references  from  computa¬ 
tions.  the  compiler  has  greater  flexibility  in  rearranging  instructions  to  improve  pipeline 
utilization.  We  also  chose  to  avoid  conditional  codes  by  using  comparison  instructions 
that  produce  a  boolean  value  in  a  register[7],  thereby  eliminating  the  constraint  that  the 
comparison  instruction  must  immediately  precede  the  conditional  branch  instruction 
that  uses  its  result. 

An  assembly  language  representation  of  this  program  fragment  is  shown  in 
figure  2.2.  Loads  are  shown  as  "a  A(x)"  where  the  address  is  specified  by  base 


a  «-  A(x) 
b  «-  a  <  1 


c :  if(  b  )  jump 


d  «-  A(y) 
e  *-d  <  2 


/  :  ifC  e  )  jump 


q  *-  B(x) 
r  —  q  ^  8 
s :  if  (  r  )  jump 


g :  B(x)  3 
h  «-  B(y) 
i  h  ^  4 


j :  if(  i  )  jump 


o :  C(x)  ♦-  7  t  :  C(x)  9 
p :  goto  X  u :  goto  Y 


“1 

v :  C(y)  ♦-  10 
w  :  goto  Z 


* :  CCx)  ♦-  5  m :  C(y)  «-  6 
l :  goto  V  n :  goto  W 


Figure  2.2.  Assembly  language  level  representation  of  decision  tree. 


register  *  with  displacement  A  and  the  content  is  placed  in  temporary  register  a .  Simi¬ 
larly.  stores  are  shown  as  "C(x)  3".  Comparisons  are  shown  as  "b  —  a  ^  1"  where 

b  receives  the  boolean  result  of  the  a  ^  1  test.  Conditional  branches  on  boolean  values 
are  shown  as  "if(  b  )  jump"  where  the  destination  of  a  taken  branch  is  shown  by  a  line. 
Labels,  such  as  "c:"  in  the  first  branch  instruction,  have  been  given  so  that  every 
instruction  can  be  identified  either  by  the  result  register  name  or  by  the  label.  This 
notation  facilitates  understanding  of  code  rearrangements  in  later  examples. 

Suppose  this  code  was  optimized  to  run  on  a  machine  with  a  very  short  instruction 
pipeline  such  as  the  RISC-1  microprocessor^].  This  microprocessor  performs  a  load  or 
delayed  branch  instruction  in  two  cycles  and  all  other  instructions  in  one  cycle.  Fig¬ 
ure  2.3  shows  the  execution  schedule  of  the  example  program  on  this  machine.  In  this 
figure  the  time  in  clock  cycles  appears  at  the  left.  The  completion  time  for  each  path 
through  the  decision  tree  is  given  by  the  issue  time  of  the  pseudo- instruction 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 
13 


a  *-  A(x) 


d  «-  A(y) 
b  ♦”  a  41  1 

c :  if  (  6  )  jump - - 

«  *-d  <  2 

/  :  if(  «  )  jump - 1 

<t>  1 

g :  B(x)  «-  3  p :  goto  X 

h  *-  B(y)  o :  C(x)  «-  7 

<t>  —don*  — 

i  —  h  ^  4 
j :  if  (  i  )  jump  -| 

*  1 

l :  goto  V  n :  goto  W 

k :  C(x)  ♦-  3  m :  C(y)  «-  6 

—don* —  —don*  — 


q  *-  B(x) 

<t> 

r  *-  q  ^  8 
s :  if(  r  )  jump 
* 

u :  goto  Y 
t :  C(x)  «-  9 

—don*  — 


~ 1 

w :  goto  Z 
v :  C(y)  *-  10 
—don*  — 


Figure  2.3.  Optimum  execution  schedule  for  short  instruction  pipeline. 


"‘—done  Note  that  a  two  cycle  load  instruction,  such  as  d .  can  be  overlapped  with 
another  unrelated  instruction,  b .  The  load  instruction  h .  however,  is  followed  only  by 
instructions  that  depend  directly  or  indirectly  on  the  result  of  the  load;  thus  no  instruc¬ 
tion  can  be  be  overlapped  with  h  and  a  no-operation  instruction,  (ft.  must  be  inserted. 

The  RISC-1  microprocessor  uses  delayed  branches  with  length  one.  In  general,  a 
delayed  branch  with  length  n  means  that  the  n  instructions  following  a  branch  are 
always  executed  regardless  of  whether  the  branch  is  taken[13].  We  call  the  n  instruc¬ 
tions  following  a  branch  the  delayed  part  of  that  branch.  Thus  instruction  e  is  in  the 
delayed  part  of  branch  e .  Similarly,  instruction  k  following  the  unconditional  branch 
l  is  executed  prior  to  the  actual  transfer  of  control. 

The  code  sequence  shown  in  figure  2.3  has  been  optimized  to  take  maximum 
advantage  of  concurrency  in  the  pipeline.  The  load  instruction  d  has  been  moved  up  to 
fill  the  delay  between  the  load  instruction  a  and  the  dependent  comparison  instruction 
b .  Similarly,  the  comparison  instruction  e  has  been  moved  up  into  the  delayed  part  of 
branch  c .  and  the  store  instructions  k .  m.o.t .  and  v  have  been  moved  down  into  the 
delayed  part  of  the  logically  succeeding  unconditional  branches. 

In  general,  the  execution  time  through  a  decision  tree  is  dependent  on  which  path 
is  taken  through  the  tree.  If  the  probability  of  taking  path  i  is  and  the  execution 
time  of  path  i  is  t( .  then  one  measure  of  performance  is  the  expected  execution  time 

Etr]»  Zm 

i«l 

where  M  is  the  total  number  of  paths  through  the  decision  tree.  For  figure  2.3,  if  each 
path  is  equally  likely,  the  expected  execution  time  in  cycles  on  a  RISC-1  microprocessor 
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El  TRisc  1  -  0.2  (  15  +  15  +  9  +  12  +  12  )  =  12.6  (clocks  ) 


For  this  example,  relatively  high  utilization  (few  <f>  cycles)  is  achieved  when  the 
machine  has  a  very  short  pipeline.  On  the  other  hand,  short  pipelines  offer  little  con¬ 
currency.  and  thus  can  achieve  only  relatively  low  performance.  In  order  to  quantify 
achieved  performance,  a  lower  bound  on  the  expected  execution  time  can  be  derived  by 
assuming  an  infinite  resource  dataflow  machine.  Since  data  flow  uses  the  assignment  of 
values  to  trigger  dependent  instructions,  no  interior  jumps  are  needed.  Thus  instruc¬ 
tions  c .  /  .  / .  and  s  are  deleted.  However,  since  the  schedule  is  for  one  decision  tree 
only,  the  exterior  goto  's.l.n.p.u.  and  w .  are  retained.  With  infinite  resources,  per¬ 
formance  is  limited  only  by  data  dependencies.  Using  RISC-1  timing,  the  execution 
schedule  for  this  example  on  an  infinite  resource  dataflow  machine  is  as  follows. 

0  a , d  ,q 

1 

2  b.t.r 

3  g.o.p.t.u.v.w 

4  h 

5  end  of  paths  3. 4.  and  5 

6  i 

7  k  ,1  .m.n 

8 

9  end  of  paths  1  and  2 

The  lower  bound  on  the  expected  execution  time  is  therefore 

El  LBgjX  ]  3  0.2  (  9+9+5+5+5  )  *  6.6  (clocks  ) 

The  performance  ratio  is  defined  as 
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Using  this  metric,  the  RISC-1  microprocessor  achieves  a  performance  level  that  is  only 
about  half  of  the  performance  level  theoretically  possible  in  an  infinite  resource 


dataflow  machine. 


Suppose  that  in  an  attempt  to  increase  the  performance  of  the  machine,  the  clock 
speed  is  increased  by  a  factor  of  four,  causing  the  instruction  pipeline  to  become  four 
times  as  long.  There  is  now  a  possibility  of  four  times  the  concurrency,  but  the  load 
and  branch  instructions  take  eight  cycles  to  complete  while  all  other  instructions  take 
four  cycles.  Figure  2.4  shows  the  execution  schedule  for  the  same  example  decision 
tree.  The  expected  execution  time,  assuming  again  that  all  paths  are  equally  likely,  is 


E[  Tnn\  =  0.2  (  57  +  57  +  36  +  36  +  36  )  =  44.4  (4X  clocks  ) 


Therefore  the  speedup  is 


SP?itk 


4  E[  Trisc  1  _  50.4  _ 
E(  TPI„  ]  =  44.4 


and  the  performance  ratio  is 
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=  59% 


This  speedup  is  very  small  in  spite  of  the  fact  that  figure  2.4  has  been  hand  coded  for 
optimal  instruction  overlap.  Note  that  the  speedup  calculation  for  lengthening  the  pipe¬ 
line  by  a  factor  of  four  does  not  take  into  account  the  overhead  represented  by  a  clock 
speedup  of  less  than  a  factor  of  four  due  to  the  additional  latching  necessary  to  imple¬ 
ment  a  pipeline  with  finer  granularity  14],  For  this  example  even  a  small  amount  of 
overhead,  i.e.  a  clock  speedup  of  less  than  3.524,  will  cause  the  speedup  to  become  less 
than  one. 


For  scalar  code,  simply  increasing  the  pipeline  length  is  rarely  a  viable  approach  to 


achieving  higher  performance.  The  main  problem  is  that,  because  basic  blocks  in  scalar 
code  tend  to  be  short  and  contain  dependencies,  there  simply  are  not  enough  indepen¬ 
dent  instructions  to  make  use  of  a  high  degree  of  concurrency. 
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a  «-  A(x) 
d  *-  A(y) 
f  *“  B(x) 

* 

8  «-  a  <  1 
e  -<f  <  2 
r  *-  9  ^  8 

c :  if(  6  )  jump 

<f> 


§  1 


U 

v'.  •»'.*- 

'.\V 


20 

/  :  if  (  e  )  jump 

s :  if  (  r  )  jump - 1 

21 

<t> 

*  \ 

* 

28 

g :  B(x)  «-  3 

p  :  goto  X 

u :  goto  Y  w :  goto  Z 

29 

h  «-  B(y) 

o :  C(x)  ♦-  7 

t :  C(x)  *-  9  v :  C(y)  *-  10 

30 

36 

4> 

—dona  — 

—done  —  —done  — 

37 

i  h  ^4 

b 

38 

<f> 

41 

j :  if(  i  )  jump  - 

1 

32 

<t> 

1 

49 

l :  goto  V 

n :  goto  W 

D 

50 

k :  C(x)  «-  5 

m :  ay)  -  6 

51 

<t> 

<f> 

57 

—dona  — 

—dona  — 

Figure  2.4. 

Optimum  execution  schedule  for  long  instruction  pipeline. 

2J.  Architectural  Support  for  Scalar  Processing 

In  an  instruction  pipeline,  no-operation  cycles  (.0)  are  inserted  either  by 
hardware!  1]  or  by  software!*]  whenever  necessary  to  insure  that  data  and/or  control 
flow  dependencies  are  met.  Some  of  these  dependencies  are  "real”  in  the  sense  that  they 
are  inherent  in  the  program  as  expressed  by  the  programmer.  For  example,  in  figure  2.4 
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the  five  <f>’s  at  times  3-7  are  necessary  because  of  the  dependency  through  the  register  a 
between  the  instructions  at  times  0  and  8  and  the  availability  of  only  two  instructions 
to  move  up  into  this  gap.  On  the  other  hand,  some  dependencies  are  an  artifact  of  the 
architecture.  For  example,  the  store  instruction  g  at  time  28  will  be  executed  whenever 
neither  of  the  branches  c  and  /  are  taken.  i.e.  whenever  b  and  e  are  both  false.  The 
values  of  b  and  e  are  actually  known  at  time  13.  but  the  store  cannot  be  issued  at  that 
time  because  it  must  wait  for  the  branches  to  be  completed.  Early  execution  of  store 
instruction  g  is  critical  since  the  relatively  slow  load  instruction  h  is  dependent  on  g . 
This  dependency  arises  because  the  values  of  x  and  y  are  defined  external  to  this  deci¬ 
sion  tree  and  hence  in  general  the  compiler  cannot  determine  whether  x  *  y .  Therefore 
to  guarantee  correctness  the  compiler  cannot  allow  h  to  be  executed  before  g .  Note 
that  instructions  g  and  h  can  be  issued  in  consecutive  clock  cycles  if  and  only  if  g  and 
h  are  not  dependent  and  if  they  reference  different  memory  banks.  Otherwise  a 
memory  bank  conflict  occurs  at  time  29  and  the  issue  of  instruction  h  is  delayed  by 
hardware  until  the  conflict  is  resolved. 

One  way  to  avoid  delaying  a  store  instruction  while  waiting  for  branches  to  be 
resolved  is  to  provide  a  guard  exprtssion[\5]  on  the  store  instruction.  A  guard  expres¬ 
sion  is  a  boolean  valued  expression.  Whenever  a  guard  expression  evaluates  to  false,  it 
inhibits  writing  of  the  final  result,  thereby  converting  the  instruction  being  guarded 
into  a  no-operation.  We  represent  a  guarded  store  instruction  by 

<  start  instruction  >  ?  <  guard  expression  > 

where  < guard  expression  >  is  a  function  of  boolean  results  generated  by  previously 
executed  comparison  instructions.  For  example,  the  store  instruction  g  could  be 


changed  to  the  guarded  store 
g :  B(x)  «-  3  ?  eS£ 
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The  guarded  store,  g  .  can  now  be  moved  into  the  delayed  part  of  branch  c  at  time  13. 
followed  by  instruction  h  at  time  14.  Similarly,  jump  and  goto  instructions  can  be 
guarded  and  moved  up.  Conditional  branches  can  be  converted  to  guarded  jumps,  and 
other  variables  can  be  added  to  the  guard  expression.  Note  that  loads,  comparisons,  and 
computation  instructions  need  not  be  guarded,  provided  that  a  sufficient  number  of 
registers  exists. 

By  making  use  of  guarded  stores  and  guarded  jumps,  the  decision  tree  shown  in 
figure  2.4  can  be  further  improved  to  yield  the  schedule  shown  in  figure  2.5.  Note  that 
once  the  delayed  part  of  a  branch  has  been  completed,  subsequent  guard  expressions 
need  not  test  the  value  that  determined  the  outcome  of  that  branch.  For  example, 
instruction  o  is  executed  if  e&5  is  true.  However,  since  the  delayed  part  of  branch  c  is 
completed  at  time  19.  the  compiler  uses  the  fact  that  if  control  flows  to  instruction  o , 
then  b  =0  so  there  is  no  need  to  include  the  6  term  in  the  guard  expression  for  instruc¬ 
tion  o.  Similarly.  6=1  is  known  if  control  flows  to  instructions  t  and  v.  Since  the 
delayed  part  of  branch  /  is  completed  at  time  22.  no  later  instructions  need  use  e  in 
their  guard  expressions. 

The  operation  of  the  pipeline  for  the  time  interval  19  to  30  is  shown  in  figure  2.6. 
In  this  figure,  instructions  flow  through  the  pipeline  from  top  to  bottom.  Lower  case 
letters  represent  instructions  from  figure  2.5.  Upper  case  letters  with  subscripts,  such 
as  Xt .  represent  the  ktk  instruction  after  the  label  X  of  a  goto  instruction.  A  blank 
entry  is  used  to  indicate  a  <P  instruction. 

Figure  2.6(a)  shows  the  pipeline  operation  when  path  3  of  the  decision  tree  is 
taken.  Referring  to  pipeline  segment  8,  primed  instructions,  such  as  c indicate  a  guard 
expression  that  evaluates  to  false.  Each  of  these  instructions  is  converted  into  no¬ 
operation  by  the  hardware.  Instructions  in  parenthesis,  such  as  ( h  ).  are  fully  executed. 
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Although  the  achieved  speedup  of  1.691  relative  to  a  lx  pipeline  is  still  much  less  than 
the  theoretical  4X  speedup,  it  is  relatively  near  the  data  flow  limit,  as  shown  by  the 
performance  ratio 

■^EJ  t gg aid  1  4  El  LBgjsc  ]  26.4 


Recall  that  this  performance  ratio  for  a  4x  pipeline  without  guarded  instructions  is 
only  59%. 

The  amount  of  hardware  required  to  evaluate  guard  expressions  depends  on  the 
complexity  of  the  expressions.  We  have  found  that  expressions  consisting  of  the  AND 
of  true  and  complemented  values  of  a  few  register  bits  is  adequate  for  supporting  fast 
scalar  processing.  Thus  very  fast  guard  expression  evaluation  can  be  implemented  inex¬ 
pensively. 

To  exploit  the  guarded  store  and  guarded  jump  instructions  HiMciwawd  above,  it  is 
necessary  to  perform  extensive  code  rearrangement.  Constraints  on  code  rearrangement 
arise  from  data  dependencies  between  instructions,  hence  it  is  critical  that  artificial 
dependencies  are  eliminated  whenever  possible.  An  important  class  of  artificial  depen¬ 
dencies  arise  due  to  register  reuse.  In  the  following  example,  no  parallelism  can  be 
exploited  in  the  code  sequence  on  the  left  because  the  instructions  forms  a  dependency 
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a  —  x  +  y 
b  «-  a  +  z 
a  *-  u  +  v 
c  «-  a  +  w 


a  ♦-  *  +  y 
i  ^  a  +  z 
a'  *-u  +  v 
c  «-  a'  +  w 


The  code  sequence  on  the  right,  however,  forms  two  independent  dependency  chains  and 
thus  execution  of  one  chain  can  be  overlapped  with  the  other.  The  improvement  in 
parallelism  was  achieved  by  renaming  the  second  assignment  of  register  a  so  as  to  avoid 
reuse.  The  technique  of  register  renaming  to  eliminate  unnecessary  dependencies  is  well 
known[l6].  When  applied  to  a  tree  of  basic  blocks,  every  temporary  register  can  be 
renamed  so  as  to  be  assigned  exactly  once,  the  so  called  single  assignment  property. 

The  use  of  single  assignment  temporaries  gives  the  compiler  maximum  flexibility 
in  reordering  code  within  a  decision  tree  so  as  to  utilize  concurrent  resources  efficiently. 
All  code  shown  in  the  examples  has  this  single  assignment  property.  Note  that  in  actu¬ 
ality  some  register  reuse  can  be  accommodated  without  performance  degradation.  Once 
a  code  sequence  has  been  generated,  a  second  pass  can  be  made  over  the  generated  code 
to  locate  disjoint  uses  of  registers  and  map  them  into  the  same  physical  register.  No 
performance  is  lost  by  this  mapping  and  register  requirements  are  reduced. 

Architectural  support  for  register  renaming  to  increase  parallelism  simply 
involves  providing  a  sufficiently  large  number  of  registers.  Extensive  code  rearrange¬ 
ment.  however,  poses  a  more  serious  problem.  When  code  is  moved  from  after  a  branch 
to  before  the  branch,  some  instructions  from  the  conditionally  taken  and  fall-through 
paths  of  the  branch  become  unconditionally  executed.  In  this  case  a  spurious  exception 
condition  can  occur  in  the  rearranged  code  due  to  the  execution  of  an  instruction  that 
would  not  have  been  executed  in  a  serial  machine. 
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One  possible  solution  is  to  encode  the  exception  condition  within  the  result  regis- 
ter{l7].  possibly  by  extending  the  length  of  the  register.  Exception  conditions  are  pro¬ 
pagated  through  subsequent  computations,  but  the  actual  signaling  of  exception 


conditions  is  deferred  until  an  attempt  is  made  to  store  the  content  of  a  register  contain¬ 
ing  an  exception  code.  With  rearrangement,  the  set  of  store  instructions  not  inhibited 
by  guard  expressions  are  exactly  the  same  as  the  set  of  stores  produced  by  a  serial  exe¬ 
cution  of  the  program  without  rearrangement.  Thus  the  signalled  exceptions  are  the 
same.  The  same  encoding  technique  as  for  arithmetic  exceptions  such  as  divide  by  zero 
can  be  used  for  illegal  memory  references  so  as  to  allow  actions  like  prefetching  past  the 
end  of  an  array.  However  this  technique  cannot  easily  be  extended  to  page  fault  excep¬ 
tions.  since  they  may  eventually  have  to  be  serviced. 

2 A.  The  Decision  Tree  Scheduling  Technique 

In  the  previous  section  we  proposed  the  guarded  store  and  guarded  jump  architec¬ 
tural  features.  These  architecture  features  can  significantly  improve  scalar  code  perfor¬ 
mance.  Efficient  use  of  these  features,  however,  requires  that  the  compiler  perform 
extensive  code  rearrangement.  This  section  describes  a  heuristic  code  generation  tech¬ 
nique  that  performs  the  necessary  code  rearrangement. 

The  main  objective  of  the  code  generator  is  to  rearrange.  i.e.  rchedul*.  the  instruc¬ 
tions  in  a  decision  tree  so  as  to  minimize  the  expected  execution  time  through  the  deci¬ 
sion  tree.  The  technique  presented  here  was  used  to  produce  the  schedule  shown  previ¬ 
ously  in  figure  2J.  A  convenient  representation  of  the  program  for  the  purpose  of 
instruction  scheduling  is  the  dependency  graph[l6].  Figure  2.7  shows  the  dependency 
graph  for  the  ongoing  decision  tree  example.  Instructions  are  shown  as  nodes  in  the 
graph,  labeled  with  either  the  result  register  name  or  the  explicit  instruction  label.  Arcs 
in  the  graph  represent  data  or  control  dependencies  between  instructions.  Within  each 
node,  the  number  on  the  left  hand  side  of  the  center  row  gives  the  earliest  time  that  the 
node  can  be  issued.  The  number  on  the  right  hand  side  gives  the  execution  delay  of  the 
node.  Note  that  the  delay  of  jumps  in  the  interior  of  the  tree  (c .  /  .  i ,  s  )  have 


early  delay 
priority 


execution  delay  +1  to  allow  overlapped  jumps  while  leaf  jumps  (l .  n .  p ,  u .  w )  have 
delay  +8  in  our  model  since  this  tree  must  be  completed  before  the  next  tree  begins. 
The  real  number  at  the  bottom  of  the  node  is  the  scheduling  priority,  to  be  discussed 
later. 

The  problem  of  finding  a  minimum  delay  schedule  for  a  set  of  dependent  tasks  is 
known  to  be  NP-hard[l8].  However,  it  is  well  known  that  list  scheduling  tech¬ 
niques!  19]  produce  good  schedules  in  practice.  We  have  developed  a  decision  tree 
scheduling  (DTS)  technique  based  on  an  extension  of  list  scheduling.  A  procedure 
implementing  the  DTS  technique  is  shown  in  figure  2.8.  This  procedure  is  initially 
invoked  with  G  equal  to  the  entire  dependency  graph  and  P  equal  to  the  set  of  all  paths 
through  the  decision  tree  from  the  root  to  a  leaf  node.  On  the  initial  call,  nothing  is 
deleted  from  the  graph  in  step  1  since  every  node  in  the  graph  lies  on  at  least  one  path 


procedure  scheduled  G.  P  ) 

G:  Dependency  graph  representing  subprogram  to  be  scheduled 
P:  Set  of  paths  through  subtree  to  be  scheduled 

begin 

1.  Delete  from  G  those  nodes  and  arcs  not  on  paths  in  P 

2.  Return  if  G  is  empty 

3.  Schedule  nodes  until  potential  transfer  of  control  flow 

4.  schedule (  G.  (jump  taken  paths}  ) 

5.  schedule (  G.  {jump  not  taken  paths}  ) 

end. 


Figure  2.8.  Decision  tree  scheduling  procedure. 


through  the  tree.  Step  3  schedules  code  in  priority  order  subject  to  dependency  con¬ 
straints.  Code  scheduling  continues  until  n  cycles  after  the  first  interior  branch  has 
been  scheduled,  where  n  is  the  delay  of  the  branch.  At  this  time  two  logically  indepen¬ 
dent  subtasks  are  created  to  handle  the  two  possible  branch  outcomes  and  "schedule"  is 
called  recursively. 

The  first  subtask  is  initiated  with  a  copy  of  the  dependency  graph  along  with  the 
subset  of  paths  through  the  decision  tree  that  pertains  if  the  jump  is  taken  (step  4). 
Similarly,  the  second  subtask  is  initiated  with  a  copy  of  the  dependency  graph  and  the 
subset  of  paths  that  pertain  if  the  jump  is  not  taken  (step  5). 

Each  subtask  schedules  code  until  n  cycles  after  the  next  interior  jump  that 
belongs  to  its  own  subset  of  paths.  Note  that  this  jump  may  have  been  scheduled  by 
the  parent  task  in  the  delayed  part  of  some  earlier  jump.  Referring  to  figure  2.5.  the 
subtask  handling  the  code  sequence  for  paths  {4. 5}  beginning  with  instruction  t  finds 
previously  scheduled  jump  s  to  belong  to  paths  {4, 5}  but  not  jump  /  .  since  /  belongs 
to  paths  {1.  2.  3).  Therefore  the  subtask  for  {4.  5}  would  stop  code  scheduling  at  time 
xf  +n  =  16  +  8  =  24,  where  x,  is  the  issue  time  of  instruction  s .  and  recursively 
divide  into  two  subtasks  to  complete  paths  {4}  and  {5}  independently.  Note  that  exte¬ 
rior  jumps  (goto  s)  do  not  terminate  code  scheduling  in  a  task.  The  recursive  division 
continues  to  the  leaves  of  the  decision  tree.  Application  of  this  code  scheduling  pro¬ 
cedure  to  the  dependency  graph  shown  in  figure  2.7  was  used  to  produce  the  code  shown 
in  figure  2.5. 

The  quality  of  code  generated  by  the  DTS  technique  is  dependent  on  the  heuristic 
used  to  compute  the  node  priorities.  Intuitively,  the  node  priorities  should  satisfy  the 
following  properties: 


(1)  A  node  on  a  high  probability  path  through  the  decision  tree  should  be  given  higher 
priority  than  a  node  on  a  low  probability  path.  Since  a  node  can  be  on  multiple 
paths,  the  priority  of  a  node  should  depend  on  the  sum  of  the  path  probabilities. 

(2)  On  a  given  path,  a  node  near  the  top  of  the  critical  path  through  the  dependency 
graph  should  be  given  higher  priority  than  a  node  near  the  bottom  of  the  critical 
path.  Also,  a  node  not  on  the  critical  path  should  be  given  lower  priority  than  a 
node  on  the  critical  path[20]. 

Property  two  can  be  quantified  as  follows:  Focus  on  a  single  path  i  through  the  decision 
tree.  Calculate,  for  each  node  on  that  path,  the  earliest  execution  time  based  on  data 
dependencies.  Define  the  path  length  to  be  the  earliest  completion  time  of  the  termi¬ 
nal  goto  instruction  of  path  i .  Sweep  backward  through  the  graph  and  calculate,  for 
each  node  j .  the  latest  time  ( latest j  )  that  each  node  must  be  issued  in  order  for  the 
path  to  be  completed  within  the  minimum  time  lt .  Nodes  on  which  the  terminal  goto 
instruction  does  not  depend  have  latest  issue  time  of  minus  the  execution  time  of  that 
node.  Define  the  urgency  of  a  node  j  on  path  i  through  the  decision  tree  to  be 

latest. 

Figures  2.9.  and  2.10  show  the  earliest  issue  time,  the  latest  issue  time,  and  the  urgency 
fo,*  each  node  on  each  path  through  the  decision  tree. 

Combining  the  urgency  metric  with  property  one  gives  the  heuristic  priority  func¬ 
tion.  For  each  node  j .  the  list  scheduling  priority  Wj  is  given  by 

M 

wj  * 

i  -1 

where  M  is  the  number  of  paths  through  the  decision  tree  and  pi  is  the  probability  cf 
taking  path  i .  Application  of  this  heuristic  function  to  the  urgency  values  shown  in 
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figures  2.9  and  2.10.  where  each  pt  is  assumed  to  be  0.2  in  this  example,  was  used  to 
provide  the  priority  values  shown  in  figure  2.7. 

A  significant  advantage  of  the  DTS  technique  is  the  sensitivity  of  the  heuristic  to 
the  values  of  the  path  probabilities.  For  example,  if  instead  of  being  equal,  the  path 
probabilities  were  (1/8.  1/8.  1/8.  1/2.  1/8).  then  the  new  node  priorities  would  be  as 
shown  in  table  2.1.  Based  on  these  new  priority  values,  the  DTS  technique  would  pro¬ 
duce  the  new  schedule  shown  in  figure  2.11.  Note  that  path  4  (the  high  probability 
path)  has  been  shortened  from  26  to  22  cycles  at  the  expense  of  increasing  the  length  of 
paths  1.  2.  and  3.  Table  2.2  shows  the  expected  execution  time  and  speedup  relative  to 
the  RISC-1  processor  for  each  of  the  schedules  under  the  asymmetric  path  probability 
assumption.  The  RESCHED  column  uses  the  schedule  in  figure  2.11  and  reflects  the 
advantage  of  rescheduling  when  path  probabilities  change. 


Table  2.1.  Node  priorities  for  path  probabilities  (1/8.  1/8.  1/8, 1/2,  1/8). 
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Table  2.2.  Expected  execution  time  and  speedup  for  asymmetric  probabilities. 
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ties.  Alternatively,  the  DTS  technique  can  be  viewed  as  an  extension  of  the  IF-tree 
technique  proposed  by  Davis{22].  An  IF-tree  is  a  binary  decision  tree  transformed  to 
use  a  large  multiway  branch.  Since  a  multiway  branch  cannot  be  executed  until  all 
dependent  conditions  have  been  evaluated,  exits  from  the  decision  tree  cannot  be  made 
until  the  last  condition  has  been  evaluated.  In  contrast  the  DTS  technique  takes  advan¬ 
tage  of  early  exits  out  of  the  decision  tree  whenever  possible. 

The  DTS  technique  can  be  extended  to  architectures  that  employ  parallel  instruc¬ 
tion  pipelines  and  a  horizontal  microcode-like  instruction  format  so  as  to  permit  issuing 
multiple  instructions  per  clock  cycle.  For  example,  instead  of  increasing  the  RISC-1 
clock  speed  by  a  factor  of  four  and  hence  multiplying  the  pipeline  length  by  four  as 
shown  in  figure  2.5.  the  same  level  of  concurrency  can  be  achieved  by  multiplying  the 
pipeline  length  by  two  and  then  duplicating  the  pipeline.  Using  the  same  node  priorities 
as  shown  in  figure  2.7.  the  new  schedule  with  up  to  two  instructions  issued  per  cycle  is 
shown  figure  2.12.  In  this  figure  the  instructions  are  shown  by  their  label  and  guard 
expression.  The  top  of  each  box  represents  the  branch  target  line  used  in  previous 
figures.  When  the  compiler  generates  two  guarded  jumps  in  the  same  clock  cycle,  they 
must  be  mutually  exclusive  so  the  hardware  implementation  remains  simple.  Assum¬ 
ing  equal  path  probabilities,  the  expected  execution  time,  speedup,  and  performance 
ratio  are 
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This  schedule  with  a  2x  clock  achieves  a  speedup  slightly  better  than  the  1.691  speedup 
obtained  by  the  single  pipeline  with  a  4x  clock. 


Another  configuration  with  the  same  concurrency  is  4  RISC-1  pipelines  using  a  lx 
clock.  The  schedule  for  this  configuration  is  shown  in  figure  2.13.  In  this  figure  the 

guard  expressions  have  been  omitted  to  save  space.  Again  assuming  equal  path  proba- 

* 

bilities.  the  expected  execution  time,  speedup,  and  performance  ratio  are 

El  TfAS-J  -  0.2  (  9  +  9  +  6  +  6  +  6  )  =  7.2  (lx  clock  ) 
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The  ability  of  the  DTS  heuristic  to  utilize  both  pipelining  and  parallelism  efficiently 
significantly  increases  the  flexibility  of  the  machine  organization  and  allows  the  machine 
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Figure  2.13.  Execution  schedule  for  four  instructions  per  cycle. 


designer  to  trade  off  pipelining  with  parallelism  for  greater  cost-effectiveness. 

25.  Performance  Evaluation 

We  have  constructed  a  compiler  to  evaluate  the  performance  of  the  DTS  technique 
and  the  guarded  store  and  jump  architectural  features.  This  compiler  accepts  a  subset 
of  the  language  C. 

The  performance  evaluation  is  based  on  a  pipelined  uniprocessor  model  derived 
from  the  scalar  portion  of  the  Cray-1.  This  baseline  uniprocessor  has  instruction  execu¬ 
tion  timing  characteristics  of  the  Cray-1  computer[23.  24]  with  branches  taking  a  con¬ 
stant  14  cycles.  The  Cray-1  branch  time  is  actually  2. 5.  or  14  clocks  for  the  cases  that 
the  branch  is  not  taken,  taken  with  branch  target  in  instruction  buffer,  or  taken  with 
branch  target  in  main  memory,  respectively.  The  constant  14  clock  assumption 
simplifies  the  baseline  uniprocessor,  but.  makes  its  performance  somewhat  lower  than 
the  performance  of  the  actual  Cray-1  on  scalar  code. 

In  this  section  performance  is  speedup  relative  to  this  baseline  uniprocessor.  The 
target  machine  model  consists  of  one  or  more  pipelined  processing  elements.  The 
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processing  elements  are  controlled  by  a  horizontal  instruction  word  with  an  instruction 
field  for  each  processing  element.  The  timing  characteristics  of  the  processing  element 
are  the  same  as  that  of  the  baseline  uniprocessor. 

In  addition  to  the  normal  operation  of  code  generation,  our  compiler  also  performs 
selective  code  replication.  Use  of  replication  was  motivated  by  the  observation  that 
pipeline  utilization  improves  with  larger  decision  trees.  Natural  decision  trees  in  most 
programs  tend  to  be  very  small  because  every  basic  block  that  has  more  than  one  prede¬ 
cessor  blocks  becomes  the  root  of  a  distinct  decision  tree.  For  example,  the  program 
fragment  shown  in  figure  2.14(a)  produces  three  decision  trees.  The  first  tree  contains 
blocks  clt  Si.  and  S2.  the  second  tree  contains  blocks  c2.  S3.  and  S4.  and  the  third  tree 
contains  the  block  S6. 

By  replicating  the  second  if -statement  and  statement  S6.  a  single  much  larger  deci¬ 
sion  tree  can  be  produced  as  shown  in  figure  2.14(b).  This  tree  consists  of  the  7  blocks 


if  (cx) 

if(Cl){ 

S  v 

Si: 

else 

if  (c2)  { 

S  2' 

S3;  S6; 

if  (c2) 

}  else  { 

s3; 

S4:  S6; 

else 

)  else  { 

S2i 

S6: 

if  (c2)  { 

S3;S*: 

)  else  { 

S4:  S6: 

} 

(a) 

(b) 

Figure  2.14.  Use  of  code  replication  to  produce  larger  decision  trees. 
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C  1*  X#  C  2).  (S  3,  S$).  (£4.  Sf,\  (S2,  C  2)*  (5  3*  5$),  ^  (5 4.  S 6). 

In  our  compiler,  code  replication  is  controlled  by  a  parameter  c.  As  long  as  a  path 
of  a  decision  tree  has  probability  greater  than  €.  an  attempt  is  made  to  replicate  code 
further  along  that  path  until  conditional  branches  cause  the  path  probability  to  fall 
below  «.  This  technique  of  weighted  code  replication  is  advantageous  in  that  high  pro¬ 
bability  subtrees  of  a  decision  tree  are  made  deeper  and  larger  by  code  replication  while 
low  probability  subtrees  are  kept  small. 

Table  2.3  shows  the  achieved  performance  for  a  binary  search  program  with  the 
list  to  be  searched  represented  as  a  balanced  linked  binary  tree  with  integer  keys.  In 
this  table,  p  is  the  number  of  processing  elements.  i.e.  the  number  of  instructions  that 
can  be  issued  in  parallel  each  clock  cycle,  and  e  is  the  level  of  code  replication.  Each 
successive  column  represents  code  replication  past  one  more  conditional  branch. 

Table  2.3  shows  that  relatively  good  speedup  of  2.392  can  be  achieved  by  the  DTS 
technique  with  guarded  operations  on  a  uniprocessor  ip  =1)  and  no  code  replication 
(€=1.0).  Additional  speedup  can  be  achieved  either  by  code  replication  (going  right)  or 
by  increasing  the  level  of  parallelism  in  the  machine  (going  down).  Greater  speedup  can 
be  achieved  by  combining  both  code  replication  and  increased  parallelism. 
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Table  2.3.  Performance  of  linked  binary  tree  search. 


Table  2.4.  Performance  of  quicksort  algorithm. 
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Table  2.5.  Performance  of  vmsched.c  from  the  UNIX  kernel. 


reducing  €  has  even  higher  relative  payoff.  Once  €  is  reasonably  low.  increasing  p  has  a 
much  greater  effect. 

Both  of  the  above  examples  are  very  small  program  fragments.  To  test  the  viabil¬ 
ity  of  our  techniques  on  more  difficult  examples,  we  evaluated  a  number  of  program 
modules  from  the  Berkeley  UNIX  kernel.  Table  2.5  shows  the  achieved  performance 
for  the  virtual  memory  scheduler  from  the  UNIX  kernel.  Less  speedup  was  achieved 
relative  to  the  previous  examples  because  this  program  module  contains  many  pro¬ 
cedure  calls  to  externally  defined  procedures.  An  external  procedure  call  always  ter¬ 
minates  a  path  in  the  decision  tree  since  code  replication  cannot  proceed  without 
knowledge  of  the  called  procedure.  Hence  the  full  power  of  code  replication  could  not 
be  applied  in  this  example.  Nevertheless,  even  this  example  demonstrates  that  our 
approach  is  capable  of  delivering  significant  speedup  on  branch-intensive  scalar  code. 

We  have  found  that  our  DTS  compilation  technique  and  the  guarded  store  and 
jump  architectural  features  are  very  effective  with  parallel-pipeline  hardware  for  vec- 
torizable  code  as  well  as  scalar  code.  Table  2.6  shows  the  achieved  performance  for  the 
first  loop  from  the  Lawrence  Livermore  Loops  bcnchmarta(25].  On  a  uniprocessor  with 
the  loop  unrolled  once  via  code  replication  (p  *1.  « *0.997),  the  result  is  comparable  to 
hardware  speedup  techniques  for  scalar  processors[26].  Using  three  processing  elements 
and  unrolling  the  loop  four  times  (p  *3,  «*0.98),  DTS  produces  a  speedup  of  9.021  via 
a  schedule  whose  average  execution  time  is  2.62  cycles  per  loop  iteration.  In  contrast, 
since  the  Cray-1  has  only  a  single  floating-point  multiply  pipeline  and  this  example  uses 
3  multiplications  per  loop  iteration,  the  maximum  Cray-1  performance  in  vector  mode 
is  no  fewer  than  3.00  cycles  per  loop  iteration,  i.e.  a  maximum  speedup  of  7.883  rela¬ 
tive  to  its  scalar  mode  performance. 


CHAPTER  3 


SCHEDULING  SIMPLE  LOOPS  FOR  OPTIMAL  THROUGHPUT 


3.1.  Introduction 

In  a  conventional  programming  environment  many  programs  spend  a  large  frac¬ 
tion  of  their  execution  time  in  looping  constructs.  Therefore  the  optimization  of  pro¬ 
gram  loops  in  order  to  speed  up  their  execution  time  is  of  paramount  importance  in  a 
high  performance  computer  system. 

Although  loops  can  be  viewed  as  scalar  code  and  scheduled  to  achieve  higher  per¬ 
formance  using  the  decision  tree  scheduling  (DTS)  technique  described  in  chapter  2,  in 
general  the  DTS  technique  cannot  deliver  maximum  loop  performance  since  the  DTS 
technique  has  the  restriction  that  a  decision  tree  complete  execution  before  another  deci¬ 
sion  tree  can  begin.  This  strategy  allows  distinct  trees  to  be  scheduled  independently 
and  was  deemed  necessary  in  order  to  reduce  the  complexity  of  the  scheduling  problem 
to  a  manageable  level.  A  disadvantage  of  scheduling  trees  independently  is  that  perfor¬ 
mance  is  compromised  during  the  transition  from  one  tree  to  another.  Therefore  loop 
performance  is  degraded  if  tree  transitions  occur  while  the  loop  is  being  executed. 

By  using  code  replication  to  cause  loop  unrolling,  the  number  of  tree  transitions 
can  be  reduced  since  multiple  iterations  of  the  loop  can  be  executed  by  a  single  decision 
tree.  However  tree  transitions  can  never  be  totally  eliminated  unless  the  loop  is  com¬ 
pletely  unrolled.  Complete  loop  unrolling  is  rarely  feasible  since  loop  iteration  limits 
are  frequently  data-dependent  and/or  tn?  number  of  iterations  is  so  large  that  complete 
unrolling  is  impractical.  Therefore  in  general  some  performance  degradation  is  inevit¬ 
able  when  loops  are  scheduled  using  the  DTS  technique. 


The  structure  of  general  loops  involving  nested  conditional  statements  and  pro¬ 
cedure  calls  are  quite  complex,  and  perhaps  in  practice  it  is  best  to  handle  them  using 
the  DTS  technique  in  spite  of  the  fact  that  tree  transitions  lead  to  suboptimal  perfor¬ 
mance.  However,  simple  loops  whose  bodies  consist  solely  of  assignment  statements 
have  a  very  simple  and  regular  structure.  This  structural  regularity  can  be  exploited  to 
unroll  the  loop  logically  and  completely  without  actually  doing  so.  This  is  the  basis  for 
the  simple  loop  scheduling  (SLS)  technique  proposed  in  this  chapter. 

The  DTS  technique  is  a  general  technique  that  is  applicable  to  any  program  con¬ 
struct.  The  SLS  technique,  on  the  other  hand,  is  applicable  only  to  a  restricted  class  of 
loops.  The  advantage  of  the  SLS  technique  is  that  it  produces  schedules  that  are 
throughput  optimal,  i.e.  optimal  performance  is  maintained  as  long  as  the  loop  continues 
to  iterate.  Suboptimal  performance  occurs  only  when  the  loop  starts  and  when  it  ter¬ 
minates. 

The  importance  of  high-speed  loop  execution  is  well  known.  Since  loop  speedup 
techniques  depend  on  the  machine  model  chosen,  we  begin  this  chapter  with  a  brief 
review  of  several  well-known  machine  models.  Following  that,  the  SLS  technique  is 
developed  in  several  stages,  beginning  with  simple  cases. 

3.2.  Architectures  and  Loop  Performance 

Consider  the  simple  loop  written  in  the  language  C  shown  in  figure  3.1.  This  pro¬ 
gram  fragment  is  representative  of  many  nonnumeric  programs  in  that  simple  loops  are 
frequently  used  to  traverse  linked  data  structures.  The  C  notation 

while  (  Kassignment  >)(...} 

means  perform  the  Kassignment  >  statement  as  specified  and  retain  the  value  that  was 
assigned.  Then  if  the  retained  value  is  nonzero  (i.e.  if  the  pointer  is  valid),  initiate 


struct  vertex  { 

struct  arc  **arc; 
int  data; 


/*  Graph  vertex  descriptor.  V 
/*  Other  fields  in  vertex  descriptor.  */ 

/*  Pointer  to  list  of  outbound  arc  descriptors.  V 
/*  A  data  field  of  interest.  */ 


struct  arc  { 


struct  vertex  *node; 


/*  Graph  arc  descriptor.  */ 

/*  Other  fields  in  arc  descriptor.  */ 
/*  Pointer  to  destination  vertex.  V 


int  T[  ]; 

struct  { 
int  index ; 
int  value: 

}RH: 

register  struct  vertex  *e : 
register  int  i  -0: 
register  int  g : 
register  int  k  -  0: 

while  (c  -  c  -*arc[l]-*node)  { 
g  -  Ttc  -*data  +  i  ]: 
i  -  Rig  ].index ; 
k  +-R [g].value: 


/*  A  list  of  indices  into  the  following  array.  */ 

/*  List  of  records  in  an  array.  */ 

/*  A  value  used  to  index  into  the  array  T.  */ 

/*  An  interesting  data  value.  */ 


f*  A  cursor  moving  through  the  graph.  */ 

/*  A  corresponding  index  into  the  array  T.  */ 
/•  A  temporary  used  to  index  into  R.  V 
/*  An  accumulator  of  interesting  values.  */ 

/*  While  more  vertices  do:  */ 

/*  Calculate  an  index  into  R.  */ 

/*  Acquire  the  next  corresponding  index.  */ 
/*  Accumulate  another  interesting  value.  */ 


Figure  3.1.  Example  of  simple  loop  in  source  language. 


another  iteration  of  the  loop  body.  If  the  retained  value  is  zero,  the  loop  terminates. 
Note  that  the  <assignment  >  statement  is  reexecuted  at  the  beginning  of  each  iteration 
of  the  loop. 

The  data  structures  and  operational  characteristics  of  this  simple  loop  are  shown 
in  figure  3.2.  The  top  part  of  the  figure  shows  a  graph  data  structure.  The  graph  is 
traversed  by  following  the  second  outbound  arc  from  each  vertex  (array  indices  begin 


with  0  in  the  language  C).  The  bottom  part  of  the  figure  shows  a  pair  of  related  tables. 


next  index  into  table  T.  The  next  entry  in  R  is  the  value  field  to  be  accumulated  in  the 
variable  k .  The  final  value  of  k  is  assumed  to  be  used  eventually  outside  of  the  loop. 

An  assembly  level  representation  of  this  simple  loop  is  shown  in  figure  3.3.  We 
have  chosen  to  use  a  load/store  architecturdl.  5. 6.  7. 8. 9]  for  reasons  discussed  in 
chapter  2.  From  this  low  level  representation  of  the  program  we  can  derive  the  depen¬ 
dency  graph  shown  in  figure  3.4.  Instructions  are  shown  as  nodes  in  the  graph,  labeled 
with  either  the  result  register  name  or  the  explicit  instruction  label.  Solid  arcs 
represent  data  dependencies  between  nodes;  downward  arcs  specify  dependencies  within 
a  single  iteration  of  the  loop  while  upward  arcs  specify  dependencies  from  one  iteration 
to  the  next.  Dashed  arcs  represent  control  dependencies  resulting  from  conditional 
branches.  The  dependency  graph  is  an  appropriate  representation  of  a  program  loop  for 
the  purpose  of  readily  viewing  the  constraints  on  code  scheduling.  In  future  examples 
we  shall  omit  the  lengthly  process  of  specifying  program  fragments  and  simply  use 
only  dependency  graphs. 


a  «-  arcvc  ) 
b  1(a) 
c  *-  node(6  ) 
d :  if  (c  =0)  exit 
e  data(c  ) 

/  *-  e  +  i 
g  -T (/  ) 
h  *-  g  «  2 
i  *-  R .index  (A ) 
j  —  R. value  (A  ) 
k  -  k  +  j 
l :  goto  loop 


Load  pointer  to  arc  pointer  list. 

Load  pointer  to  second  arc  descriptor. 
Load  pointer  to  destination  of  arc. 
Terminate  loop  if  no  more  vertices. 
Load  data  field  from  vertex  descriptor. 
Use  data  to  offset  recirculating  index. 
Load  index  to  array  of  records. 

Shift  index  to  form  offset. 

Load  new  value  of  recirculating  index. 
Load  new  data  value  of  interest. 
Accumulate  new  data  value. 

Start  another  iteration. 


Figure  3.3.  Assembly  level  representation  of  simple  loop. 


Such  a  loop,  executed  on  a  wide  variety  of  architectures,  will  achieve  varying  lev¬ 
els  of  performance.  In  this  section  we  compare  the  advantages  and  disadvantages  of 
several  architectures  for  this  kind  of  loop. 

3.2.1.  Scalar  Architectures 

Consider  a  pipelined  scalar  architecture  such  as  the  RISC-1  microprocessor^]. 
This  microprocessor  issues  one  instruction  per  cycle,  and  performs  a  load  or  delayed 
branch  instruction  in  two  cycles  and  all  other  instructions  in  one  cycle.  Figure  3.5 
shows  a  table  of  instruction  issue  times  for  one  iteration  of  the  loop.  Entries  in  the 
"INSTRUCTION  ISSUED"  column  contains  either  the  instruction  being  issued,  or  <j>  if  a 
data  dependency  prevents  the  next  instruction  from  being  issued  in  that  clock  period. 


TIME 

INSTRUCTION  ISSUED 

0 

a  arc(c  ) 

1 

<t> 

2 

b  «-  l(a  ) 

3 

<f> 

4 

c  node(6  ) 

5 

0 

6 

d :  if  (c  =0)  exit 

7 

e  —  data(c  ) 

8 

0 

9 

/  *-  e  +  i 

10 

g  -T (/  ) 

11 

l :  goto  loop 

12 

13 

j  *-  R .value  ( h  ) 

14 

t  *-  R .index  ( h  ) 

15 

k  *-  k  +  j 

Figure  3.5.  Scalar  processor  schedule  for  one  iteration  of  the  loop. 
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Note  that  the  schedule  shown  in  figure  3.5  has  been  optimized  to  take  maximum  advan¬ 
tage  of  instruction  overlap.  To  facilitate  instruction  overlap  further,  we  have  assumed 
that  delayed  branches  can  be  delayed  by  greater  than  two  cycles  if  desired.  This  allows 
branch  instruction  l  to  be  issued  at  time  11,  which  otherwise  would  have  been  a  <f> 
cycle.  An  unshown  parameter  within  the  branch  instruction  increases  the  delay  and 
causes  the  next  iteration  to  begin  at  time  16. 

As  figure  3.5  shows,  each  iteration  of  the  loop  takes  16  clock  cycles.  Therefore  the 
time  for  a  scalar  RISC-1  microprocessor  to  complete  n  iterations  is 

T scalar  =  16/t 

We  shall  use  this  time  as  the  basis  for  comparison  with  other  architectures. 

3.2.2.  Vector  Architectures 

Higher  levels  of  performance  can  be  achieved  by  increasing  the  level  of  con¬ 
currency  beyond  that  offered  by  a  scalar  processor.  Suppose  that  the  RISC-1  micropro¬ 
cessor  is  augmented  with  vector  capabilities  similar  to  those  of  the  Cray-l[l].  In  order 
to  use  vector  instructions,  a  loop  must  be  distributed  into  a  set  of  simpler  loops  such 
that  each  new  loop  corresponds  to  exactly  one  vector  instruction[l6].  This  transforma¬ 
tion  is  shown  in  figure  3.6.  In  this  example  we  optimistically  assume  that  some 
unspecified  hardware  is  available  for  loop  control,  hence  we  ignore  the  if  and  goto 
instructions.  Only  loops  2,  4,  and  possibly  5  can  be  vectorized  since  the  other  loops 
each  contain  more  than  one  instruction.  The  reason  loops  1  and  3  cannot  be  distributed 
further  is  that  the  statements  within  them  form  recurrence^  7].  It  is  well  known  that 
recurrences  involving  pointers  through  memory  or  indirect  references  through  arrays 
such  as  those  in  loops  1  and  3  cannot  be  vectorized  and  must  be  executed  serially.  Loop 
5  is  also  a  recurrence,  but  it  is  a  special  case  in  that  it  is  a  vector  reduction  operation 


loop  1: 

-arcCcn,^) 

K  l(«*m  ) 

em  -  node(bm  ) 

Load  pointer  to  arc  pointer  list. 

Load  pointer  to  second  arc  descriptor. 
Load  pointer  to  destination  of  arc. 

ignored : 

d :  if  (cm  =0)  exit 

Terminate  loop  if  no  more  vertices. 

loop  2: 

em  —  data(cm  ) 

Load  data  field  from  vertex  descriptor. 

loop  3: 

f  m  ^  *m— 1 

gm  -T(/m) 
hm  *~gm  «  2 
im  -  JL.index  (hm  ) 

Use  data  to  offset  recirculating  index. 
Load  index  to  array  of  records. 

Shift  index  to  form  offset. 

Load  new  value  of  recirculating  index. 

loop  4: 

jm  -  R.value{hm) 

Load  new  data  value  of  interest. 

loop  5: 

jm 

Accumulate  new  data  value. 

ignored : 

l :  goto  loop 

Start  another  iteration. 

Figure  3.6.  Example  of  vectorization  by  loop  distribution. 

involving  an  associative  operator.  On  some  vector  machines  special  hardware  is  avail¬ 
able  to  evaluate  such  recurrences  quickly  with  a  single  vector  instruction[28]. 

From  figure  3.6  we  can  derive  the  best  case  execution  timing  for  a  vector  processor 
with  memory  access  delay  of  two  cycles  and  arithmetic  computation  delay  of  one  cycle. 
Namely,  we  assume  that  the  vector  processor  has  a  sufficient  number  of  processing  ele¬ 
ments  or  pipelines  so  that  resource  contention  is  not  an  issue.  However,  we  do  assume 
that  there  is  only  a  single  scalar  processor  with  an  instruction  issue  rate  of  one  instruc¬ 
tion  per  clock  cycle,  since  that  is  usually  the  case  for  vector  architectures.  Loops  1  and 
3  must  be  executed  on  the  scalar  processor.  Loop  1  contains  a  chain  of  three  loads  so  its 
execution  time  is  6n  cycles.  Loop  3  contains  a  chain  of  two  loads  and  two  ALU  opera¬ 
tions  so  its  execution  time  is  also  6n  cycles.  Loops  2  ind  4  each  contain  one  load 
instruction  and  so  can  be  vectorized.  Since  all  n  iterations  of  a  vectorizable  loop  can. 


with  sufficient  resources,  be  performed  in  parallel,  loops  2  and  4  take  2  cycles  each.  We 
shall  be  very  optimistic  and  assume  that  loop  5  can  be  treated  as  if  it  was  a  pure  vector 
instruction  and  charge  only  1  clock  for  its  execution.  Taken  together  we  find  that  the 
best  execution  time  on  a  vector  processor  is 

^vector  “  +2  4“  Sti  +  2  +  1  =  12/i  +  5 


The  speedup  relative  to  the  scalar  processor  is  given  by 
_  T scalar  _  1 6n 


SR 


VECTOR 


1  VECTOR 


12/1  +  5 


For  a  large  number  of  iterations  the  speedup  approaches 
SRvector  *  1-333 

Note  that  this  speedup  is  very  optimistic  and  does  not  take  into  account  overhead  for 
loop  control. 

In  spite  of  the  fact  that  vector  architectures  offer  much  more  concurrency,  the 
achieved  performance  is  rather  poor  for  this  example.  The  reason  is  that  only  a  small 
fraction  of  the  loop  is  vectorizable  due  to  the  extensive  number  of  recurrences  and 
hence  most  of  the  execution  must  be  performed  in  scalar  mode.  Almost  all  linked  data 
structures  cause  recurrences  through  memory.  The  use  of  linked  data  structures  is  per¬ 
vasive  in  nonnumeric  programs  as  well  as  certain  numerical  programs,  such  as  those 
that  operate  on  dynamic  sparse  matrices.  In  view  of  the  high  cost  of  providing  vector 
execution  capabilities,  the  use  of  a  vector  architecture  is  inappropriate  for  job  loads  con¬ 
taining  significant  usage  of  linked  data  structures. 


3.23.  Multiprocessor  Architectures 


A  multiprocessor  is  more  flexible  than  a  vector  architecture.  A  schedule  showing 
issue  times  for  the  first  three  iterations  on  a  multiprocessor  consisting  of  three 


independent  scalar  RISC-1  microprocessors  is  shown  in  figure  3.7.  In  this  figure  the 
instructions  are  represented  by  either  the  result  register  name  or  the  explicit  label  as 


TIME 

PRC 

1 

>CESS 

2 

iOR 

3 

0 

a 

1 

<t> 

2 

b 

3 

4 

c 

5 

6 

d 

a 

7 

e 

<t> 

8 

<t> 

b 

9 

f 

* 

10 

& 

c 

11 

l 

4> 

12 

h 

d 

a 

13 

i 

t 

* 

14 

i 

+ 

b 

15 

k 

f 

<t> 

16 

s 

c 

17 

l 

18 

h 

d 

19 

j 

e 

20 

i 

<t> 

21 

k 

22 

_J_ 

23 

l 

24 

h 

25 

j 

26 

i 

27 

k 
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Figure  3.7.  Multiprocessor  schedule  for  three  iterations  of  the  loop. 
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shown  in  figure  3.3.  Because  recurrences  transmit  data  to  future  iterations  of  the  loop, 
the  initiation  time  of  each  loop  iteration  is  delayed  relative  to  the  previous  iteration. 
Techniques  for  calculating  the  value  of  this  delay  and  transformation  algorithms  for 
minimizing  this  delay  have  been  developed[29].  The  schedule  shown  is  the  best  possible 
in  terms  of  minimizing  the  inter-iteration  delay,  which  is  six  in  this  example.  Provided 
there  are  enough  processors  to  eliminate  resource  conflicts,  the  execution  time  to  com¬ 
plete  n  iterations  on  a  multiprocessor  is 


1  MULTI 


=  6n  +  9 


Note  that  instruction  k  takes  only  one  time  unit  to  complete  execution  and  that  all 
instructions  are  actually  completed  by  the  time  k  is  completed.  The  speedup  relative  to 
a  single  scalar  processor  is  therefore 

T SCALAR 


MULTI 


'■MULTI 


1 6n 

6n  +  9 


2.667 


Since  multiprocessors  are  more  flexible  than  vector  processors,  they  can  be 
expected  to  achieve  higher  speedup  for  a  wider  class  of  application  programs.  In  this 
example  a  multiprocessor  architecture  was  able  to  do  twice  as  well  as  a  vector  architec¬ 
ture.  However  in  calculating  the  timing  for  a  multiprocessor  we  have  ignored  the  com¬ 
munication  and  synchronization  time  between  processors  and  the  time  lost  due  to 
memory  port  contention.  If  a  multiprocessor  has  independent  scalar  processing  units 
that  operate  asynchronously  from  each  other  (an  asynchronous  multiprocessor),  then 
some  amount  of  interprocessor  communication  overhead  is  inevitable.  When  asynchro¬ 
nous  multiprocessors  are  used  to  evaluate  recurrences,  the  achieved  performance  is  very 
sensitive  to  interprocessor  communication  overhead.  For  example,  even  a  modest  inter¬ 
processor  communication  delay  of  two  clock  cycles  per  iteration  would  lower  the 
speedup  from  2.667  to  2. 


The  asynchronous  characteristic  of  multiprocessors  is  advantageous  in  that  it  pro¬ 
vides  greater  application  flexibility  and  also  allows  a  larger  collection  of  processors  to 
be  coupled  together,  due  to  less  restrictive  clocking  requirements.  When  applied  to 
recurrences,  however,  these  advantages  cannot  easily  be  realized  because  recurrences 
severely  restrict  the  number  of  processors  that  can  profitably  be  used.  Referring  to 
figure  3.7,  we  see  that  the  maximum  number  of  processors  that  can  be  used  efficiently 
on  this  example  is  three,  since  the  fourth  iteration  cannot  begin  until  time  18.  whereas 
processor  1  is  free  by  time  16.  Thus  asynchronously  coupling  large  numbers  of  proces¬ 
sors  not  only  incurs  the  additional  cost  of  asynchronous  interprocessor  communication, 
but  provides  no  performance  gain  in  many  applications.  If  only  few  processors  can  be 
used  effectively,  a  small  synchronous  system  is  more  cost-effective. 

3.24.  Horizontal  Architectures 

An  alternative  to  the  asynchronous  multiprocessor  architecture  is  the  horizontal 
architecture^.  30. 31].  A  horizontal  architecture  consists  of  multiple  processing  ele¬ 
ments  controlled  by  a  single  instruction  issue  unit.  These  architectures  use  wide 
instructions  with  multiple  fields  to  control  each  processing  element  independently  in  a 
manner  similar  to  horizontal  microcode.  The  processing  elements  may  be  specialized 
pipelined  functional  units(30,  31]  or  relatively  unspecialized  scalar  processors[9].  The 
main  advantage  of  horizontal  architectures  is  that  they  are  globally  synchronized  and 
extremely  tightly  coupled.  These  characteristics  allow  horizontal  architectures  to  pro¬ 
vide  low-overhead  high-bandwidth  communication  between  processing  elements  at  rela¬ 
tively  low  cost.  The  disadvantage  is  that  relatively  fewer  processing  elements  can  be 
supported  by  such  architectures.  This,  though,  is  not  a  serious  handicap  for  recurrence 
intensive  workloads  which  cannot  effectively  utilize  many  processors  in  any  case. 


Consider  a  horizontal  architecture  that  consists  of  a  two-segment  pipelined 
memory  reference  unit,  a  one-segment  ALU,  and  a  two-segment  pipelined  delayed  jump 
unit.  This  configuration  was  chosen  to  be  compatible  with  the  RISC-1  microprocessor 
being  used  as  the  basis  for  comparison.  The  separation  of  functions  is  motivated  by  the 
observation  that  distinct  specialized  hardware  is  needed  for  each  of  the  pipelines,  hence 
parallel  execution  of  these  functions  is  reasonable. 

Figure  3.8  shows  the  execution  schedule  for  one  iteration  of  the  loop  on  this  pro¬ 
cessor.  In  this  figure  <j>  cycles  are  shown  as  blank  entries.  Note  that  the  time  for  one 
iteration  is  15  cycles  instead  of  16  cycles  for  the  RISC-1  microprocessor.  This  reduction 
arises  because  the  increased  concurrency  allows  instructions  d  and  e  to  be  issued  in  the 
same  clock  cycle.  However,  if  only  this  small  speedup  of  1.067  were  attained  it  would 
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Figure  3.8.  Horizontal  architecture  schedule  for  one  iteration  of  the  loop. 


hardly  be  worthwhile  in  view  of  the  higher  cost  of  a  horizontal  architecture. 


High  performance  is  achieved  on  a  horizontal  architecture  when  the  execution  of 
multiple  loop  iterations  is  overlapped.  The  left  hand  side  of  figure  3.9  shows  one  itera¬ 
tion  of  the  schedule  for  the  loop.  Several  extra  delays  have  been  added  to  the  compact 
schedule  shown  in  figure  3.8  for  reasons  that  will  become  apparent.  Note  that  only  the 
first  segment  of  the  MEM  and  IMP  pipelines  have  been  shown  since  the  behavior  of  the 
second  segment  can  be  inferred  from  the  first  segment. 

The  right  hand  side  of  figure  3.9  shows  a  composite  schedule  obtained  by  overlap¬ 
ping  four  copies  of  the  left  hand  side  schedule  each  delayed  by  seven  clocks.  This  time 
delay  between  iterations  is  called  the  initiation  interval.  The  superscript  indicates  the 
iteration  number  associated  with  each  particular  instruction.  An  initiation  interval  of 
seven  clocks  is  sufficient  to  satisfy  dependencies  between  iterations.  The  inter-iteration 
dependencies  are  cm  /  m+1,  and  km 

As  shown  in  the  figure,  beginning  with  clock  14  the  memory  pipeline  is  fully  util¬ 
ized  and  continues  to  be  fully  utilized  throughout  future  iterations.  Therefore  an  ini¬ 
tiation  interval  of  7  is  minimum.  The  execution  time  on  the  horizontal  architecture  is 


^HORIZONTAL  =  7ft  +  16 

Hence  the  speedup  relative  to  a  scalar  processor  is 


_  T SCALAR  _  16ft 

HORIZONTAL  *  T* -  4. 

1  HORIZONTAL  ln  +  16 


2.286 


While  this  speedup  not  as  high  as  the  asynchronous  multiprocessor  speedup  of  2.667,  it 
should  be  noted  that  communication  between  processing  elements  in  a  horizontal  archi¬ 
tecture  does  not  introduce  any  overhead  since  interprocessor  synchronization  is  precal¬ 
culated  at  compile  time.  The  multiprocessor  required  three  memory  ports  and  three 
ALUs  to  achieve  a  speedup  of  2.667  while  the  horizontal  architecture  required  only  one 


memory  port  and  one  ALU  to  achieve  almost  as  high  a  speedup.  Therefore  the  horizon¬ 
tal  architecture  can  be  expected  to  be  less  costly  than  the  multiprocessor.  Furthermore, 
considering  that  the  asynchronous  multiprocessor  speedup  drops  to  2.000  with  the 
introduction  of  even  very  modest  communication  delay,  the  horizontal  architecture  can 
be  expected  to  outperform  the  asynchronous  multiprocessor  for  many  job  loads. 


3.23.  Summary 

High  performance  computer  systems  need  to  be  efficient  both  on  vectorizable 
numerical  job  loads  as  well  as  on  other  more  general  job  loads.  The  extensive  use  of 
linked  data  structures  causes  many  recurrences  through  memory.  These  recurrences 
cannot  be  transformed  into  vector  form,  hence  vector  architectures  cannot  provide  sub¬ 
stantial  speedup  on  such  job  loads.  Recurrences  also  limit  the  number  of  processors 
that  can  be  used  profitably  in  a  multiprocessor  architecture.  This,  coupled  with  the  fact 
that  asynchronous  multiprocessors  generally  have  nonnegligible  interprocessor  com¬ 
munication  overhead,  suggests  that  asynchronous  multiprocessors  consisting  of  a 
number  of  conventional  scalar  processors  may  not  be  the  most  cost-effective  architec¬ 
ture  for  general  job  loads. 

The  use  of  horizontal  architectures  appears  to  offer  improved  performance  over 
vector  architectures,  and  improved  cost-effectiveness  over  asynchronous  multiproces¬ 
sors.  High  utilization  of  horizontal  multiprocessors  can  be  obtained  by  code  scheduling 
so  as  to  maximize  overlap  between  loop  iterations.  In  the  next  sections  we  develop  a 
technique  for  the  automatic  generation  of  optimal  throughput  loop  schedules  for  hor¬ 
izontal  architectures. 


33.  Scheduling  Graphs  with  Acyclic  Dependencies 

The  complexity  of  finding  an  optimal  throughput  schedule  for  a  loop  is  highly 
dependent  on  the  characteristics  the  dependency  graph  representing  the  loop.  For  pur¬ 
poses  of  code  scheduling  it  is  convenient  to  classify  the  nodes  in  a  dependency  graph.  N. 
into  subsets 

N=0R*  USUA 

t*  1 

This  classification  is  based  on  strongly  connected  component^. 32].  Strongly  connected 
components  of  a  directed  graph  are  defined  to  be  maximal  sets  of  nodes  such  that  if 
nodes  x  and  y  are  members  of  the  same  strongly  connected  component,  then  there  is  a 
directed  path  from  x  to  y  and  also  a  directed  path  from  y  to  x .  Each  strongly  con¬ 
nected  subgraph  containing  two  or  more  nodes  is  called  a  multinode  recurrence,  denoted 
by  the  set  R* .  The  total  number  of  such  multinode  recurrences  is  m .  Other  strongly 
connected  subgraphs  contain  only  one  node  each  and  are  called  self-loops,  denoted  by 
the  set  S.  The  remaining  nodes  are  not  in  any  strongly  connected  components.  These 
nodes  are  classified  as  acyclic,  denoted  by  set  A.  Note  that  the  node  set s 
Rl»  R2 . ®  and  A  are  all  disjoint. 

The  problem  of  optimal  throughput  loop  scheduling  is  to  find  a  valid  schedule 
with  a  minimum  initiation  interval  ( Mil ).  The  simplest  case  occurs  when  the  depen¬ 
dency  graph  is  acyclic.  e.g.  the  graph  contains  only  acyclic  nodes.  An  optimal  schedul¬ 
ing  technique  for  acyclic  dependency  graphs  has  been  proposed  by  Rau[3l]  for  use  on 
horizontal  architectures.  In  this  section  we  briefly  review  Rau's  results. 

3.3.1.  The  Available  Resource  Limit  Constraint 


If  a  graph  contains  only  acyclic  nodes  then  every  iteration  of  the  loop  is  indepen¬ 
dent  of  every  other  iteration.  Consider  the  acyclic  dependency  graph  with  nodes  N  *  A 


shown  in  figure  3.10.  This  graph  was  derived  from  figure  3.4  by  ignoring  the  feedback 
arcs,  thus  removing  inter-iteration  dependencies.  The  letters  M.  A.  and  J  within  each 
node  specify  the  pipelined  functional  unit  (MEM,  ALU.  and  JMP,  respectively) 
required  by  that  node.  The  number  in  each  node,  of  the  form  +s .  gives  the  number  of 
segments  in  the  functional  unit  pipeline. 

Since  there  are  seven  memory  references  needed  per  iteration,  and  only  one  pipe¬ 
lined  memory  unit  is  available,  successive  iterations  cannot  be  initiated  less  than  seven 
clocks  apart.  In  general  there  is  a  lower  bound  on  the  Mil  based  on  resource  con¬ 
straints.  This  lower  bound  is  called  the  available  resource  limit  ( ARL  )  and  is  defined  as 
follows. 

ARL  (N)  =  max  Y  8,  (c  ) 

C€C 

Where 

1  if/,  =c 
S‘  ^  )  =  0  otherwise 

Nodes  are  numbered  from  1  to  n ,  indexed  by  i .  The  type  of  functional  unit  required 
by  a  node  is  given  by/,.  C  denotes  the  set  of  all  functional  unit  types.  In  this  exam¬ 
ple,  C  -  {M.  A.  J}. 

The  available  resource  limit  states  that  the  initiation  interval  is  lower  bounded  by 

*  * 

the  most  heavily  used  resource.  Since,  for  this  example,  £8,  (M)  =  7,  £8,  (A)  =  3. 

i»l  l*i 

n 

and  £8,  (J)  *  2.  the  ARL  is  7.  This  lower  bound  is  not  restricted  to  schedules  for  acy- 

i»l 

clic  dependency  graphs.  In  general,  a  valid  schedule  for  any  dependency  graph  must 
satisfy  the  condition 


Mil  (N)  7t  ARL  (N) 


where  N  is  the  set  of  all  nodes  in  the  graph. 

Patel[33j  has  shown  that  for  strictly  acyclic  dependency  graphs,  at  least  one  func¬ 
tional  unit  can  always  be  fully  saturated.  Therefore 

Mil  (A)  =  ARL  (A) 


Once  the  Mil  has  been  found,  it  is  relatively  easy  to  construct  a  valid  schedule  with 
that  initiation  interval. 


Before  presenting  the  algorithm  for  schedule  construction,  some  terminology  is 
necessary.  A  schedule  for  a  set  of  nodes  N  =  {l, . . . ,  n }  is  given  by  the  issue  times  of 
each  node,  denoted  by  .  1  .  The  initiation  interval  for  a  schedule  is  denoted  by 

p .  The  execution  delay  of  a  node  is  equal  to  the  number  of  segments  in  the  pipelined 
functional  unit  used  by  that  node.  This  delay  is  given  by  sft ,  where  /{  is  the  func¬ 
tional  unit  used  by  node  i . 

A  resource  conflict  occurs  if  two  nodes  require  the  same  functional  unit  in  the 
same  clock  cycle.  Note  that  since  successive  iterations  of  a  loop  are  overlapped  with  an 
initiation  interval  of  p .  a  node  from  iteration  k  issuing  at  time  t  will  conflict  with  a 
node  from  iteration  k  —1  issuing  at  time  t  +p .  provided  both  nodes  require  the  same 
functional  unit.  More  generally,  the  modulo  usage  function  is  defined  as 


Sj (c .  t ) 


1  if  /j  *  c  and  Gtj  mod  p  )  =  t 
0  otherwise 


This  function  is  1  if  and  only  if  a  node  i  requires  resource  c  at  any  of  the  times 
t .  f  +p .  f  +2p ,  •  •  •  For  a  schedule  to  be  valid  (e.g.  causes  no  resource  conflicts),  the  fol¬ 
lowing  condition  must  hold. 
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£Si  (c ,  t)  <  1  for  aZZceC,  0  <  r  <  p 

>=i 

An  algorithm  for  finding  an  optimal  schedule  for  an  acyclic  graph  is  given  in 
figure  3.11.  Note  that  nodes  in  a  directed  acyclic  graph  can  always  be  topologically 
sorted  to  get  a  linear  sequence  in  which  every  node  depends  only  on  nodes  preceding  it 
in  the  sequence.  Without  loss  of  generality  we  can  assume  that  the  set  of  nodes  N  is  so 
ordered.  An  informal  description  of  algorithm  A  follows: 

(1)  The  first  node  is  not  dependent  on  any  other  node,  hence  step  A4  does  nothing. 
Step  A 5  also  does  nothing  since  no  resource  has  been  reserved  for  any  node. 
Therefore  x  x  =  0. 

(2)  Suppose  that  a  partial  schedule  consisting  of  nodes  1  through  j—  1  has  already 
been  found.  The  earliest  issue  time  for  node  j  can  be  computed  by  examining  all 
the  predecessor  nodes  that  j  depends  on.  given  by  the  set  pred  (y  ).  This  examina¬ 
tion  is  carried  out  in  step  A4.  Since  the  nodes  are  being  processed  in  topological 
order,  all  the  predecessors  of  j  must  have  already  been  processed  and  hence  their 
Xj  are  well  defined. 


.  *.  v  *. 
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Al. 

p  *-  ARL  (N) 

.  r 

A2. 

for  /  «-  1  to  n  do 

A3. 

Xj  *-  0 

'  *> 

A4. 

for  each  i  €  pred  (/  )  do  x}  ♦- ma x{  Xj  ,  xi+sfl  } 

A5. 

j-i 

■while  £8((/j.Xj  mod  p )  ^  0  do  Xj  «-  x}  +  1 

1*1 

c:  r 


Figure  3.11.  Algorithm  A:  optimal  throughput  schedule  for  acyclic  graphs. 


(3)  If  issuing  node  j  at  the  earliest  permissible  time  does  not  cause  a  resource  conflict 
then  node  j  is  assigned  to  that  time.  Otherwise  the  starting  time  of  node  /  is 
incremented  until  there  is  no  conflict.  This  resolution  of  resource  conflicts  is  car¬ 
ried  out  in  step  A 5.  Note  that  because  p  has  been  calculated  to  be  just  large 
enough  so  as  not  to  overutilize  any  resource,  the  while-loop  in  step  A 5  terminates 
after  at  most  p—  1  iterations. 

The  application  of  algorithm  A  to  the  graph  of  figure  3.10  yields  the  schedule  shown  in 
figure  3.12. 

Because  every  iteration  of  the  loop  is  identically  scheduled.  A  complete  characteri¬ 
zation  of  the  steady  state  behavior  of  the  loop  can  be  obtained  by  looking  at  p  consecu¬ 
tive  clock  cycles.  One  can  simply  divide  a  schedule  into  sections  each  p  clock  cycles 
long,  label  successive  sections  with  successively  decremented  iteration  superscripts,  and 
overlay  them  on  top  of  one  another.  This  overlaid  representation  is  called  a  modulo 
reservation  table  (MRT).  Figure  3.13  shows  the  MRT  which  corresponds  to  the  example 
schedule.  The  superscript  gives  the  iteration  numbers.  The  subscript  gives  the  issue 
time  of  the  node  in  clock  cycles  relative  to  the  beginning  of  the  schedule.  Note  that  the 
issue  times  are  redundant;  a  node  z~*  is  issued  at  time  xt  ■  t  +  kp ,  where  t  is  the  slot 
time  in  the  MRT.  For  example,  the  node  /  ~2  has  a  slot  time  of  5.  hence  its  issue  time  is 
5  +  2-7  =  19.  Although  redundant  in  this  example,  the  issue  time  is  included  to  be 
compatible  with  later  examples. 

In  addition  to  being  a  more  compact  nspresentation  of  the  schedule,  the  MRT  is 
also  a  convenient  data  structure  for  the  evaluation  of  the  while  condition  in  step  A 5  of 
algorithm  A.  The  while-loop  terminates  when  a  time  slot  is  found  whose  use  by  the 
current  node  will  not  cause  resource  conflicts.  By  using  the  MRT,  the  while-loop  ter¬ 
minates  when  there  is  an  empty  slot  in  the  appropriate  functional  unit  column.  A 


Figure  3.13.  Modulo  reservation  table  for  acyclic  graph  schedule. 


nonexistent  previous  iterations  are  unused.  The  number  of  suboptimal  iterations  is 
related  to  the  length  of  the  schedule.  I .  defined  as 


max{  x,  } 

l  =  JUL -  +  l 

P 


Referring  to  figure  3.13.  a  schedule  with  length  l  *  4  has  nodes  belonging  to  iterations 
in  the  range  0  through  — (1  —1)  =  —3.  During  iteration  1.  the  time  slots  reserved  for 
nodes  labeled  as  iterations  -1.  -2.  and  -3  are  unused.  In  general,  during  iteration 
k .  1  <1 .  the  time  slots  reserved  for  nodes  belonging  to  iteration-;  labeled 

—k . — (i  —1)  are  unused.  Therefore  a  schedule  of  length  l  achieves  optimal  perfor¬ 

mance  beginning  with  iteration  l . 

In  algorithm  A  the  assignment  of  x}  was  made  based  on  the  first  available  slot  in 
the  MRT.  There  is  no  reason  why  the  first  empty  slot  must  be  assigned.  A  shorter 


then  the  complexity  reduces  to  O  (jnp  ).  Note  that  for  acyclic  graphs,  p  ^  n  since  full 
utili2ation  of  at  least  one  resource  is  achieved. 

Algorithm  A  has  a  relatively  low  complexity  of  O  ( np  )  because  it  generates  an 
optimal  throughput  schedule  but  does  not  guarantee  minimum  startup  penalty.  On  the 
other  hand,  to  find  an  optimal  throughput  schedule  with  minimum  startup  penalty,  it  is 
necessary  to  consider  all  feasible  assignments  of  Xj  at  step  A 5  instead  of  the  first  feasi¬ 
ble  assignment.  Since  there  are  O  Cp)  feasible  assignments  per  node,  the  total  complex¬ 
ity  of  such  an  algorithm  is  0  (pn  ).  Patelf33]  has  demonstrated  an  efficient  branch-and- 
bound  algorithm  for  finding  optimal  throughput  schedules  with  minimum  startup 
penalty.  In  view  of  the  fact  that  loops  usually  continue  for  many  iterations  and  there¬ 
fore  the  impact  of  the  startup  penalty  is  amortized  over  a  long  period  of  time,  it  may 
not  be  worthwhile  to  pay  the  additional  computational  cost  necessary  to  generate  a 
minimum  length  schedule.  However,  the  need  for  minimum  length  schedules  reappears 
when  more  general  graphs  are  considered,  to  be  discussed  in  section  3.5. 

The  observation  that  there  are  multiple  feasible  assignments  of  the  Xj  leads  to  the 
minimum  complexity  optimal  throughput  scheduling  algorithm  by  Rau[3l],  shown  in 
figure  3.15.  Algorithm  B  has  a  lower  complexity  than  algorithm  A  by  making  use  of 
the  fact  that  nodes  can  be  placed  anywhere  into  the  MRT  so  long  as  they  are  in  the 
proper  functional  unit  class  column.  A  description  of  algorithm  B  follows: 

(1)  Steps  B4-B5  finds  the  earliest  issue  time  for  each  node.  These  steps  are  the  same  as 
steps  A3-A4  in  algorithm  A. 

(2)  Step  B 6  performs  resource  assignment.  Each  node  is  assigned  a  slot  time.  rf in 
the  MRT.  The  counters.  re .  keep  track  of  the  next  empty  slot  time  for  each 
resource  class,  c  «  C.  Note  that  the  available  resource  limit  assures  that  rc  <  p 
for  every  resource  class  c . 


-1 


p  *-  ARL  (N) 
for  each  c  6  C  do  rc  «- 
for  }  *-  1  to  n  do 

Xj  *-  0 

for  each  i  e  pred  (/  )  do  xt  *-  max{  x^ .  Xj  4 !Sf  } 
T/i  -  rh  + 1 


Figure  3.15.  Algorithm  B:  minimum  complexity  optimal  throughput  schedule. 


(3)  Step  B7  adjusts  the  issue  time  of  each  node  such  that  it  falls  on  the  assigned  slot 
time.  rfr  This  adjustment  is  made  by  increasing  Xj  .  if  necessary,  to  satisfy  the 
equation  x}  mod  p  »  rfj.  Since  nodes  are  assigned  in  topological  order,  adding 
extra  delay  to  the  issue  time  of  a  node  can  never  violate  data-dependency  con¬ 
straints  in  an  acyclic  graph. 

Application  of  algorithm  B  to  the  graph  in  figure  3.10  yields  the  schedule  shown  in 
figure  3.16.  Optimal  throughput  is  achieved  since  the  MEM  column  is  completely  filled. 
However  the  length  of  this  schedule  is  seven,  significantly  longer  than  the  schedule 
length  of  four  produced  by  algorithm  A.  In  general  algorithm  B  can  be  expected  to  pro¬ 
duces  longer  schedules  than  algorithm  A.  This  longer  length  has  an  adverse  impact  on 
loop  startup,  but  may  be  negligible  if  the  loop  continues  for  many  iterations. 

The  complexity  of  algorithm  B  can  be  calculated  as  follows:  The  loop  in  step  B2 
requires  c  iterations.  The  loop  in  step  B3  requires  n  iterations.  The  inner  loop  in  step 


Figure  3.16.  Optimal  throughput  schedule  from  algorithm  B. 


B5  requires  n  iterations  in  the  worst  case.  Therefore  the  total  complexity  of  the  algo¬ 
rithm  is  O  (c  +  n2).  Typically,  the  number  of  functional  unit  classes  is  fixed  and  the 
node  fan-in  is  bounded  by  a  constant.  Under  this  assumption  c  is  a  constant  and  the 
nested  loop  in  step  B5  requires  a  total  of  O(n)  iterations.  Thus  the  complexity  of 
algorithm  B  can  be  reduced  to  O  (n  ).  This  is  clearly  the  minimal  order  of  complexity 
since  each  node  must  be  examined  at  least  once  in  order  to  generate  code. 

3.33.  Summary 

Loops  whose  dependency  graphs  are  acyclic  can  be  scheduled  to  achieve  optimal 
throughput  using  algorithm  B.  The  O  (n  )  complexity  of  algorithm  B  is  minimal.  This 
result  forms  the  basis  for  the  proposed  simple  loop  scheduling  technique  to  be  proposed. 

Algorithm  B  generates  schedules  that  are  suboptimal  in  terms  of  length.  To  gen¬ 
erate  schedules  with  minimum  length  requires  an  algorithm  whose  complexity  is 


O  (jin  ).  This  high  complexity  makes  the  generation  of  such  schedules  unatt  active  for 
acyclic  graphs  since  the  length  of  a  schedule  affects  only  the  loop  startup  penalty  and 
not  the  loop  steady-state  performance.  However  the  generation  of  minimum  length 
schedules  is  necessary  for  the  more  general  graphs  in  section  3.5.  which  involve  mul¬ 
tinode  recurrences. 

3.4.  Scheduling  Graphs  with  Self-Loop  Dependencies 

In  the  previous  section  we  presented  scheduling  algorithms  for  strictly  acyclic 
dependency  graphs  with  nodes  N  =  A.  In  this  section  we  present  an  extension  for  han¬ 
dling  dependency  graphs  that  include  self-looping  nodes,  but  no  multinode  cycles. 
These  graphs  have  nodes  N  »  S  JJ  A. 

One  common  source  of  self-loops  is  loop  induction  variables[34].  An  induction 
variable  x  is  a  variable  whose  only  assignments  within  the  loop  are  of  the  form 
x  =  x  +  c .  where  c  is  a  constant  or  loop-invariant  value.  Optimizing  compilers  fre¬ 
quently  generate  induction  variables  to  step  through  arrays.  For  example,  the  Fortran 
loop  in  figure  3.17(a)  will  be  transformed  into  the  loop  in  figure  3.17(b)  by  an  optimiz¬ 
ing  compiler  in  order  to  eliminate  the  multiplication  otherwise  needed  to  evaluate  the 
address  of  A(i.j).  Other  sources  of  self-loops  include  reduction  operators  such  as  vector 
summation,  and  traversals  of  linked  data  structures.  Statements  such  as  p  =  p  -*next 
in  C  or  i  =  A  (i )  in  Fortran  are  commonly  used  to  traverse  linked  data  structures. 
Such  statements  cause  self-loops  in  the  dependency  graph. 

Consider  a  self-loop  node  i  of  the  form  x  »  x  +  1.  The  operation  in  this  node  is 
dependent  on  the  value  generated  by  the  same  node  in  the  previous  iteration.  Since  the 
execution  time  of  node  i  is  given  by  sfl .  successive  iterations  of  a  loop  must  be 
separated  by  at  least  sfl  clock  cycles  in  order  to  allow  enough  time  for  the  addition 
function  in  the  previous  iteration  to  be  completed.  This  constraint  on  the  Mil  of  a 


It  should  be  noted  that  many  self -loops  form  linear  recurrences[  27].  Techniques 
are  available  for  transforming  these  recurrences  into  faster  forms  when  the  SLL  con¬ 
straint  prohibits  maximum  resource  utilization.  The  SLL  constraint  can  be  relaxed 
when  such  transforms  are  applicable  and  used.  These  transformation  techniques  are 
compatible  with  the  scheduling  techniques  proposed  in  this  chapter,  but  we  shall  not 
pursue  them  further  in  this  thesis. 

3.5.  Scheduling  Graphs  with  General  Dependencies 

A  general  dependency  graph  may  contain  acyclic  nodes  and  self-loop  nodes,  as 
well  as  one  or  more  multinode  recurrences.  Consider  a  loop  containing  a  multinode 
recurrence  as  shown  in  figure  3.18.  Because  the  back  arcs  go  from  the  very  bottom  of 
the  graph  to  the  very  top.  there  can  be  no  overlap  between  successive  iterations.  There¬ 
fore  maximizing  the  throughput  of  such  a  loop  is  equivalent  to  minimizing  the  delay 
through  the  acyclic  subgraph  which  excludes  the  back  arcs.  The  problem  of  scheduling 
a  set  of  dependent  tasks  on  a  machine  with  limited  resources  so  that  the  total  delay  is 
minimized  is  the  same  problem  as  the  minimum  startup  penalty  scheduling  problem, 
and  is  known  to  be  NP-hard[29].  However  heuristics  which  work  well  in  general  are 

available  in  the  literature{29,  33].  It  should  be  noted  that  although  obtaining  an 

» 

optimal  schedule  for  multinode  recurrences  is  NP-hard.  it  is  quite  practical  to  do  so  if 
the  number  of  nodes  involved  is  small. 

Having  established  that  in  general  obtaining  a  maximum  throughput  schedule  for 
a  loop  containing  a  single  multi node  recurrence  is  NP-hard,  one  might  consider  finding  a 
way  to  decompose  a  loop  containing  multiple  multinode  recurrences  into  several  smaller 
NP-hard  problems,  one  for  each  recurrence.  Unfortunately  this  is  not  possible  in  gen¬ 
eral.  Consider  the  example  graph  in  figure  3.4.  This  graph  contains  two  multinode 
recurrences  {a .  b  ,  c }  and  {/  .  g ,  h ,  i }.  Each  of  these  can  be  individually  scheduled 


with  an  initiation  interval  of  six  cycles  as  shown  in  figure  3.19.  Since  the  sum  of  the 
resource  requirements  for  both  schedules  is  only  five,  it  would  seem  possible  to  combine 
the  two  schedules  and  retain  the  six  cycle  initiation  interval.  However,  such  a  combina¬ 
tion  cannot  be  achieved  in  this  case  because  g  and  i  must  be  three  clocks  apart  and  this 
separation  is  incompatible  with  the  two  clock  separation  required  between  a .  b  ,  and  c 
as  well  as  between  c  and  the  following  a .  Since  both  schedules  Hit  rigid  in  the  sense 
that  no  node  of  either  schedule  can  be  delayed  without  increasing  the  p  of  that 
schedule,  the  two  schedules  cannot  be  combined  to  form  a  joint  schedule  with  an  initia¬ 


tion  interval  of  six. 


TIME 

0 


MEM 


ALU 


TIME 

0 


MEM  ALU 


68 


2 

£ 

£ 

5 


Figure  3.19.  Example  of  separately  scheduled  multinode  recurrences. 


Because  multinode  recurrences  cannot  in  general  be  decomposed,  it  is  necessary  to 
use  a  combinatorial  technique,  such  as  branch-and-bound.  to  find  the  maximum 
throughput  schedule  for  a  set  of  multinode  recurrences.  It  follows  that  for  a  set  of 
multinode  recurrences  R  -  {  Rt.  R2 . R*,  }. 

MII  (R)  £  max{  Mil  (Rj,  MII  (R2) . MII  (R„  )  } 

For  a  loop  with  a  general  dependency  graph  whose  node  set  is 

n  =  (jr*  UsU  a 

*  »i 

the  optimal  throughput  schedule  must  also  satisfy  the  available  resource  limit  and  the 
self-loop  limit.  Therefore,  once  MU  (R)  is  found  by  some  combinatorial  technique,  the 
minimum  initiation  interval  for  the  entire  loop  must  satisfy 

MU  (N)  >  maz{  ARL  (N).  SLL  (S).  MII  (R)  } 

At  this  point  the  following  question  arises:  Is  it  possible  to  construct  a  schedule  for  N 
such  that  the  achieved  initiation  interval,  p .  is  max{  ARL  (N),  SLL  (S),  MII  (R)  }?  An 
affirmative  answer  to  this  question  would  be  significant  in  that  once  MII  (R)  is  known. 


3  6.. 
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the  optimal  initiation  interval  p  is  readily  found.  A  schedule  for  that  initiation  inter¬ 
val  must  exist,  and  that  schedule  must  yield  the  highest  steady-state  throughput 
(neglecting  startup  time). 

We  have  developed  an  efficient  algorithm  to  construct  throughput-optimal 
schedules  for  simple  loops  with  general  dependency  graphs,  with  the  achieved  initiation 
interval  p  —  max{  ARL  (N),  SLL  (S).  Mil  (R)  }.  Before  presenting  the  simple  loop 
scheduling  algorithm,  we  first  present  a  formal  specification  of  the  optimization  prob¬ 
lem  that  must  be  solved  to  construct  a  throughput-optimal  schedule. 

3.5.1.  Formulation  of  the  Optimization  Problem 

A  program  loop  is  represented  by  a  directed  data  dependency  graph  G  =  (N.  D) 
containing  instruction  nodes  N  =  {l,  2 . n  }.  Dependencies  or  edges  between  instruc¬ 

tion  nodes  are  represented  by  an  X  n  dependency  matrix  D  =  [<f ,  }  ].  If  two  nodes  i 
and  j  are  independent  then  dtJ  =  oo.  Otherwise  dx  j  gives  the  distance  in  loop  itera¬ 
tions  between  the  source  and  destination  of  the  dependency.  If.  in  the  current  iteration, 
node  j  is  dependent  on  node  i  also  of  the  current  iteration,  then  the  dependency  dis¬ 
tance  is  zero  iterations  so  J  =  0.  If  node  j  is  dependent  on  node  i  of  the  previous 

iteration  then  d{i  si.  If,  through  subscript  or  pointer  analysis  it  is  known  that  node 

* 

j  can  only  depend  on  node  i  of  the  k,K  previous  iteration,  then  d{  J  —  k .  Note  that 
d,'j  >  0  for  alii  «N.y  «N. 

Recall  that  the  functional  unit  class  used  by  an  instruction  node  i  is  given  by  /,• 
and  so  the  execution  delay  of  that  node  is  given  by  Sfy.  Previously  we  have  assumed 
that  there  is  exactly  one  resource  unit  of  each  particular  class.  We  now  generalize  that 
assumption  by  using  ue  to  denote  the  number  of  functional  units  of  a  particular  class 


Let  be  the  issue  time  for  node  i  and  let  the  initiation  interval  for  the  schedule 
be  p .  The  problem  of  constructing  a  maximum  throughput  schedule  is  to  assign  integer 
values  to  x{  such  that  the  initiation  interval  p  is  minimized  without  violating  any 
dependency  or  resource  constraints.  The  problem  of  finding  a  schedule  with  a  minimum 
initiation  interval  is  formally  stated  in  figure  3.20. 

Constraint  (1)  is  necessary  to  prevent  data  dependency  violations.  Suppose  j  is 
dependent  on  i  of  the  same  iteration.  Then  j  =0.  so  the  constraint  becomes 
x,  +  sfi  <  Xj .  This  inequality  simply  says  that  Xj  must  finish  execution  before  Xj  can 
start.  If  j  is  dependent  on  i  from  the  previous  iteration,  then  there  is  an  extra  latitude 
of  p  cycles  due  to  the  intervening  iteration.  This  latitude  is  reflected  by  the  term 
—p  di  j .  Note  that  if  j  does  not  depend  on  i ,  then  dt  J  —  oo,  so  constraint  (l)  becomes 
vacuous  for  the  node  pair  (t .  j  ). 

Constraint  (2)  is  necessary  to  preclude  resource  usage  conflicts.  Consider  a  clock 
cycle  r  .  0  <  r  <  p  .  Because  successive  iterations  of  a  loop  are  overlapped  with  a  shift 
of  p  cycles,  a  node  of  iteration  k  scheduled  for  time  t  will  occur  concurrently  with  a 


Assign  Xj .  i  €  N.  so  as  to  minimize  p  subject  to  the  constraints 

Xj  +  sfl  —  p  dt  j  <  Xj  i  «  N.  ;  6  N 

£8j  (c ,  r  )  <  uc  ceC.  0  4  t  <  p 
i=l 


where 

8i (c  .  t ) 


1  if  / j  =  c  and  (x(  mod  p)  —  t 
0  otherwise 


(1) 

(2) 


Figure  3.20.  Formulation  of  the  optimal  scheduling  problem. 


for  every  ordered  pair  of  nodes,  (i ,  j  ).  Substitution  yields 


'  +  (p 


+  sft  -(p-p'+p'^'di  j  - 


*j‘  +  (P 


-p,¥\ 


<  0 


Rearrangement  yields 


(*<’+*/,  ~P'dl.J  -X/j  +  Cp-poljjrj-rfi.J  -  J7- 


<  0 


Since  the  old  schedule  was  a  valid  schedule,  the  original  node  starting  times  must 
satisfy  data-dependency  constraints.  Hence 


*i'  +  s/t  -P'di.j  -Xj'  <  0 


Noting  that  (.p—p 0  >  0,  it  follows  that  the  data-dependency  constraint  for  the  new 
schedule  is  satisfied  if 


The  validity  of  the  old  schedule  implies  that 
x,'  <  xj'  +  p'di'j  -sfl 

Since  x,'  appears  in  a  nonnegative  term  in  inequality  (3),  replacing  that  term  with 
another  term  that  is  no  smaller  yields  a  tighter  constraint.  Thus,  if 

I Xj'  +  P’-di.j  stl 


is  satisfied,  then  (3)  and  hence  the  data-dependency  constraint  for  the  new  schedule  is 
satisfied.  Since  d{  ,  is  an  integer,  it  can  be  canceled  to  yield 


This  inequality  is  satisfied  since  Sf{  >0.  Hence  Part  (a)  is  proved. 

Part  (b)  —  Prove  resource  utilization  constraint  satisfied  for  the  new  schedule: 

The  resource  utilization  constraint  states  that 

£s,(c.r)  ^  uc 
i  *1 

for  0  <  t  <  p  and  c  €  C  where 

1  if  fi  —  c  and  (xf  mod  p)  -  t 
Sj  (c .  t )  —  q 

Let  8f '  refer  to  the  old  schedule  and  8j  refer  to  the  new  schedule  as  defined  below: 

[1  if  /j  =c  and  (xf'  mod  />')  =  £ 

Sf'Cc.O  ^  o  otherwise 

1  if  fi  =  c  and  ((x,  '  +  (/>-/>')  )  mod  p  )  *  t 

8|  (c ,  t )  =  q 

Then  the  resource  utilization  constraint  is  satisfied  if  8|  ’(c .  t )  =  St  (c ,  t ) 
equivalently  if 


Rearrangement  and  cancellation  yields 


Using  the  fact  that 


=  a.  the  equation  reduces  to 


L  *  J 

ap'  +  P  +  (p-p  Qot  j  „  ^ 


Simplifying  yields 


The  equality  is  satisfied  since  P  <  p'  <  p  and  at  is  an  integer.  Hence  Part  (b)  is  proved. 


The  Simple  Loop  Scheduling  Algorithm 

The  ability  to  extend  the  initiation  interval  of  an  existing  schedule  to  accommo¬ 
date  more  nodes  allows  us  to  adapt  algorithm  B  to  the  more  general  case  involving  mul¬ 
tinode  recurrences.  Recall  that  algorithm  B  requires  that  the  nodes  of  the  acyclic  graph 
be  ordered  in  topological  ordering.  Topological  ordering  of  the  nodes  in  a  cyclic  graph  is 
obviously  impossible.  Instead  we  use  a  topological  ordering  based  on  the  acyclic  super¬ 
structure  graph[32]  of  the  general  graph,  constructed  as  follows: 

(1)  For  each  multinode  recurrence  R*  do  the  following:  Delete  all  dependency  arcs 
whose  source  and  destination  are  both  in  Rt .  Replace  the  nodes  in  Rt  by  a  single 
new  node,  rk .  and  connect  all  remaining  dependency  arcs  to  and  from  nodes  in  Rt 


(2)  Delete  all  self -looping  dependency  arcs  from  the  node  set  S.  At  this  point  the 
graph  is  acyclic. 

(3)  Topologically  order  the  acyclic  graph  and  let  the  nodes  form  a  sequence.  For  each 
single  node.  rk ,  representing  a  multinode  recurrence.  R* .  do  the  following:  Expand 
each  rk  back  into  the  multinode  recurrence  Rt .  Place  the  nodes  in  R*  into  the 
sequence  so  that  if  rk  was  between  the  nodes  x  and  y .  then  all  the  nodes  in  R*  are 
placed  between  x  and  y .  Reconnect  the  dependency  arcs  at  rk  to  the  appropriate 
individual  nodes  in  Rk  as  before. 

(4)  Restore  the  self-looping  dependency  arcs  from  the  node  set  S.  At  this  point  the 
original  graph  has  been  restored. 

This  procedure  produces  the  node  ordering 

N  =  {N0.R1.Nl.R2.N2.  •  •  * .  Rm .  Nm  } 

with  the  property  that  a  set  of  nodes.  N* .  contains  the  nodes  that,  in  the  topological 
ordering  of  the  acyclic  superstructure  graph,  fell  between  nodes  rk  and  rk  +l.  Note  that 

On*  =*s  U  a 

km 0 

The  simple  loop  scheduling  (SLS)  algorithm,  shown  in  figure  3.21.  uses  this  node 
ordering.  Note  that  the  SLS  algorithm  finds  a  maximum  steady-state  throughput 
schedule  for  the  loop,  but  makes  no  attempt  to  reduce  loop  start-up  time.  A  description 
of  the  SLS  algorithm  follows. 

Cl)  Step  C2  constructs  an  optimal  throughput  schedule  {  x  |*.  x2'» .  • . ,  }  with  initia¬ 
tion  interval  p '  for  the  set  of  multinode  recurrences  R  “  {  Rx.  R2 . R„  }.  using  a 

combinatorial  search  procedure.  This  schedule  is  extended,  if  necessary,  in  step  C3 


Cl.  procedure  simple_loop_scheduler  (N) 

C2.  construct  Mil  schedule  for  R 

C3.  extend  schedule  such  that  p  -  max{  ARL  (N).  SLL  (S).  Mil  (R)  } 

C4.  reserve_time_slots  (R) 

C5.  assign__issue_times  (No) 

C6.  for  k  «-  1  to  m  do 

C7.  delay__issue_tim.es  (Rt ) 

C8.  assign_issue_times  (Nt ) 

C9.  end 

CIO.  procedure  reserve_time_slots  (R) 

Cll.  for  each  c  eC  do 

C12.  for  t  *-  0  to  p—  1  do  Yt  e  «-  tig 

C13.  re  -  -1 ;  Y_1<e  -  0 

C14.  for  each  ieRdoY^  «-  Y*,  ^  -  1 

C15.  end 


C16. 
Cl  7. 
C18. 
C19. 
C20. 

C21. 

C22. 

C23. 


procedure  assign_issue_times  (N*  ) 
for  each  y  e  N*  do 

Xj  *-  0 

for  each  i  €  pred  (y  )  do  Xj  *-  max{  Xj  ,  xt  +sfi  } 
vv(./4  -°  do  r/(  +1 


C24.  procedure  delay_issue_times  (R*  ) 

C25.  d  «-0 

C26.  for  each  j  e  R*  do 

C27.  for  each  i  epred(y)— Rt  do  d  <- max{  d ,  +r/(— } 

C28.  for  each  j  «R(  do  Xj  *-  Xj  +  p 
C29.  end 


Figure  3.21.  Algorithm  C:  simple  loop  scheduling  algorithm. 


to  accommodate  the  resource  requirements  of  other  nodes  in  the  graph  using  the 


formula  x, 


=  xt '  +  (.p  —p ') 


previously  presented. 


(2)  Step  C4  calls  a  procedure  to  reserve  the  resource  time  slots  used  by  the  schedule 
for  the  multinode  recurrences.  Steps  C11-C13  initialize  counters.  re.  and  the 
two-dimensional  array.  Y.  which  represents  the  modulo  reservation  table.  Each 
row  in  Y  represents  a  time  slot.  Each  column  in  Y  represents  a  class  of  resource. 
An  entry  Y,  c  gives  the  number  of  free  resources  of  class  e  at  modulo  time  t . 
Initially,  all  entries  in  Y  are  set  to  the  number  of  available  resources  of  the 
appropriate  class,  .  In  addition,  a  dummy  row.  -1.  has  been  added  to  Y  to  sim¬ 
plify  the  handling  of  the  rc  counters.  Step  C14  reserves  the  resources  needed  by 
the  multinode  schedule  by  decrementing  entries  in  Y. 

(3)  Step  C3  calls  a  procedure  to  assign  the  issue  times  for  the  first  set  of  nodes  not 
involved  in  multinode  recurrences.  This  procedure  is  essentially  the  same  as 
algorithm  B.  with  the  exception  that  multiple  resources  of  the  same  class  are 
allowed.  Step  C18-C19  finds  the  earliest  issue  time  for  node  Xj .  Step  C20  finds  a 
time  slot  that  has  a  free  resource  of  the  appropriate  class.  Step  C21  adjusts  the 

issue  time  of  Xj  to  fall  in  that  time  slot.  The  resource  is  reserved  in  step  C22. 

* 

Note  that  by  selecting  p  to  be  no  less  than  ARL ,  it  is  guaranteed  that  there  will 
be  enough  time  slots  to  accommodate  all  the  nodes. 

(4)  Step  C7  calls  a  procedure  to  adjust  the  issue  times  of  nodes  in  a  multinode 
recurrence.  This  adjustment  is  necessary  since  the  multinode  recurrence  schedule 
found  in  step  C2  does  not  take  into  account  dependencies  between  nodes  in  a  mul¬ 
tinode  recurrence  and  other  nodes.  Steps  C26-C27  find  the  minimum  delay  that 
must  be  added  to  the  issue  time  of  each  node  in  Rt  to  satisfy  dependency  con¬ 
straints  from  previously  scheduled  nodes.  This  minimum  delay  is  rounded  up  to 


the  next  multiple  of  p  and  added  to  the  issue  time  of  each  node  in  R*  by  step  C28. 
Note  that  dependencies  between  nodes  in  Rt  are  guaranteed  to  be  satisfied  if  every 
node  in  Rt  is  delayed  by  the  same  amount.  Also  note  that,  because  of  the  node 
ordering,  nodes  in  Rt  can  only  depend  on  nodes  previously  scheduled  and  other 
nodes  in  Rt .  Thus  the  minimum  delay  calculation  in  step  C27  checks  only  those 
predecessors  of  j  that  are  not  members  of  R* .  These  predecessors  are  given  by  the 
set  pred  (j  )— Rt . 

(5)  Step  C8  calls  a  procedure,  previously  described,  to  assign  the  next  set  of  nodes  not 
involved  in  multinode  recurrences. 

3.5.4.  Example  of  Schedule  Generation  by  the  SLS  Algorithm 

The  operation  of  the  SLS  algorithm  (algorithm  C)  is  illustrated  below,  using  the 

example  graph  from  figure  3.4.  with 

UMEM  —  UALU  5=1  UJMP  —  1 

Note  that  the  alphabetic  node  ordering  has  been  chosen  to  conform  to  the  ordering 

required  by  algorithm  B.  Therefore  N0  *  <f>.  Rl  =  {a .  b ,  c }.  Nx  *  {d ,  e }. 
=  (/  .  g  ■ h  •  i  }■  and  N2  =  {;.*.  I }. 

(1)  The  Mil  schedule  for  the  two  multinode  recurrences  Rt  and  R2  is  shown  in 

figure  3.22.  Recall  that  the  memory  pipeline  is  two  stages  long  and  the  ALU  pipe¬ 
line  is  one  stage  long.  The  initiation  interval  p  -  7  is  already  large  enough  to 
accommodate  ARL  and  SLL  ,  hence  no  extension  is  necessary. 

(2)  Since  N0  is  empty,  the  first  step  is  to  process  Rx.  However  nodes  in  Rx  have  no 

predecessor  nodes  outside  of  Rx  so  the  delay  d  -  0. 

(3)  The  next  step  is  to  schedule  nodes  in  Nx.  There  are  two  nodes,  d  and  e ,  in  Nx  and 

they  are  both  dependent  on  node  c .  Hence  the  earliest  issue  time  for  both  d  and  e 
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Figure  3.22.  Mil  schedule  for  multinode  recurrences. 


is  6.  The  highest  empty  slot  in  the  MEM  column  is  row  3.  so  e  is  assigned  at  time 
10.  The  highest  empty  slot  in  the  JMP  column  is  row  0,  so  d  is  assigned  at  time  7. 

(4)  The  delay  for  R2  is  computed  next.  Node  /  from  R2  is  dependent  on  node  e. 
Since  node  e  was  issued  at  time  10  and  the  memory  pipeline  is  two  stages  long, 
node  /  must  not  be  issued  until  time  12.  To  avoid  causing  resource  utilization 
conflicts  the  delay  must  be  adjusted  to  the  next  multiple  of  p .  Therefore  the  issue 
time  of  each  nods  in  R2  must  be  delayed  by  14  as  shown  in  figure  3.23. 

(5)  Finally,  nodes  Lorn  N2  are  inserted  as  shown  in  figure  3.24.  In  this  figure,  super¬ 


scripts  have  been  added  to  indicate  relative  iterations.  As  shown  by  the  super¬ 
scripts.  this  example  schedule  contains  four  overlapped  iterations.  This  schedule 
produced  by  the  SLS  algorithm  is  the  same  as  the  example  shown  in  figure  3.9. 


TIME 
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Figure  3.24.  Complete  optimal  throughput  schedule. 


33  J.  Summary 

The  running  time  of  the  SLS  algorithm  can  be  derived  as  follows:  Step  C2  uses  a 
combinatorial  search  method  so  in  the  worst  case  this  step  requires  exponential  time. 
However  the  combinatorial  search  operates  only  on  the  set  R.  We  shall  use  the  notation 


I R I  to  denote  the  size  of  the  set  R  and  NP  ( I R I )  to  indicate  exponential  complexity 
over  the  set  R. 

Assuming  that  the  size  of  the  set  C  is  constant,  the  "reserve_time_slots"  pro¬ 
cedure  has  complexity  O  (p  +  I R I ).  However  since  I R I  ^  n ,  the  complexity  can  be 
expressed  as  O  (n  )  +  O  (p  ). 

The  “assign_issue_times”  procedure  is  similar  to  algorithm  B  and  has  complexity 
O  ( I  Nt  I ).  Assuming  that  the  number  of  predecessors  per  node  is  bounded  by  a  con¬ 
stant.  the  *'delay_issue_times”  procedure  has  complexity  0(  I  Rt  I).  Since 

m  m 

£  I  Nk  I  +  £  IRt  I  =  ft .  the  total  complexity  of  steps  C5-C8  isO(n). 

k  =0  k  =1 

Taken  together  the  complexity  of  algorithm  C  is 
iVP(IRI)  +  0(n)  +  OQ>) 

This  complexity  indicates  that  our  algorithm  is  very  efficient  if  either  (i)  the  number  of 
nodes  in  multinode  recurrences  is  small,  or  (ii)  the  combinatorial  search  algorithm  for 
scheduling  multinode  recurrences  is  efficient.  In  conventional  job  loads,  simple  loops  are 
usually  small;  larger  loops  almost  always  involve  nested  conditional  statements  and 
hence  cannot  be  processed  by  SLS.  The  number  of  nodes  in  multinode  recurrences  in  a 
small  loop  is  of  course  also  small.  Therefore  the  contribution  of  the  NP  ( I R I )  term  to 
the  complexity  measure  should  not  preclude  algorithm  C  from  being  used  in  practice. 

Aside  from  the  NP  term,  the  remaining  complexity,  O  (n  )  +  O  (p  ).  is  optimal 
because  (i)  every  node  must  be  visited  at  least  once  to  generate  code,  and  (ii)  every 
instruction  cycle  (every  row  in  the  MRT)  must  also  lie  visited  at  least  once.  Since  there 
are  0  (n  )  nodes  and  0(p)  instruction  cycles,  the  complexity  of  O  (n  )  +  O  ip  )  is  clearly 


minimal. 


CHAPTER  4 


MACHINE  ORGANIZATION  AND  CODE  GENERATION  ISSUES 


4.1.  Introduction 

The  study  of  compiler  code  generation  techniques  is  necessarily  dependent  on  the 
choice  of  target  machine  model.  One  of  the  most  important  considerations  in  choosing  a 
target  machine  model  is  the  level  of  abstraction.  A  highly  abstract  model  is  advanta¬ 
geous  in  that  the  scheduling  algorithms  developed  for  such  models  are  unencumbered 
by  implementation  details.  This  may  lead  to  clean  theoretical  results  which  give 
insights  to  the  solution  of  global  problems.  Unfortunately  some  of  the  implementation 
details  ignored  by  a  highly  abstract  model  may  turn  out  to  be  critically  important  con¬ 
straints  or  efficiently  exploitable  architectural  features.  In  this  thesis  we  have  chosen  to 
develop  scheduling  algorithms  based  on  a  fairly  detailed  machine  model  in  order  to 
explore  the  relationship  between  machine  organization,  instruction  set  architecture,  and 
compiler  code  scheduling  techniques.  This  chapter  describes  the  target  machine  model, 
discusses  implementation  considerations  that  motivated  the  machine  organization,  and 
develops  solutions  to  the  practical  code  generation  problems  of  register  assignment  and 
branch  handling. 

The  proposed  target  machine  is  a  lightly-coupled  Heterogeneous  Multiprocje?so.r 
(THUMPER).  This  type  of  machine  is  characterized  by  the  following  attributes: 

Tightly-coupled 

System  synchronization  is  provide  by  a  single  system-wide  clock  to  achieve  low- 
overhead  interprocessor  synchronization.  Individual  processors  are  interconnected 
by  a  high-bandwidth  low-delay  network  to  provide  high-speed  interprocessor 


communication. 


Heterogeneous  multiprocessor 

A  high  degree  of  concurrency  is  provided  through  multiple  processors,  each  of 
which  may  be  pipelined  to  increase  performance  further.  Improved  cost- 
effectiveness  is  attained  through  the  capability  to  mix  identically  replicated 
general-purpose  VLSI  processors  and  heavily  pipelined  special-purpose  f unctional 
units,  and  by  the  capability  to  parameterize  both  the  size  of  the  multiprocessor 
system  as  well  as  the  composition  of  the  processors. 

Figure  4.1  shows  an  example  configuration  of  a  THUMPER  with  three  processors:  an 
integer  arithmetic  processor,  a  floating-point  arithmetic  processor,  and  a  memory  access 
processor.  The  processors  are  interconnected  by  a  crossbar  network  with  embedded 
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Figure  4.1.  Block  diagram  of  a  THUMPER  configuration. 
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storage  capability  at  each  crosspoint[35].  In  the  following  sections  we  discuss  some  of 
the  design  considerations  that  lead  to  this  machine  organization. 

4.2.  Processors  and  Memory  System  Design 

The  design  of  processors  is  influenced  by  two  conflicting  requirements:  low  cost 
and  high  performance.  In  a  multiprocessor  organization,  low  cost  can  be  achieved  by 
replicating  identical  multifunction  processors,  particularly  using  VLSI  technology.  On 
the  other  hand  to  achieve  very  high  performance  it  is  usually  more  cost  effective  to  use 
a  suite  of  heavily  pipelined  specialized  functional  units.  A  heterogeneous  multiproces¬ 
sor  organization  captures  the  advantage  of  both  by  incorporating  distinct  classes  of  pro¬ 
cessors.  The  organization  of  each  class  of  processors  can  be  optimized  to  achieve  max¬ 
imum  cost-performance.  Additional  parallelism  can  be  provided  through  replication. 

In  our  study  of  code  generation  techniques  we  have  found  that  certain  constraints 
on  processor  designs  significantly  simplifies  and/or  improves  the  running  speed  of 
scheduling  algorithms.  Here  we  distinguish  between  explicitly-scheduled  resources  and 
implicitly-scheduled  resources.  An  explicitly-scheduled  resource  is  a  resource  for 
which  contention  may  occur.  An  implicitly-scheduled  resource  is  a  resource  whose 
availability  is  guaranteed  provided  that  its  associated  explicitly-scheduled  resource  was 
available  at  some  prior  time.  For  example,  the  first  stage  of  a  linearly  pipelined  func¬ 
tional  unit  shared  by  multiple  processors  is  an  explicitly-scheduled  resource  while  all 
the  subsequent  stages  are  implicitly-scheduled  resources  since  the  availability  of  the 
first  stage  guarantees  the  availability  of  all  subsequent  stages  at  the  proper  later  time. 
From  the  point  of  view  of  code  generation,  explicitly-scheduled  resources  are  the  only 
ones  that  need  be  considered,  and  we  shall  use  the  term  resource  to  mean  explicitly- 
scheduled  resources  unless  otherwise  noted. 


A  most  important  consideration  is  the  number  of  resources  and  the  time  lag 
between  uses  of  those  resources  for  each  individual  instruction.  All  such  resources  and 
their  relative  time  slots  must  be  considered  when  scheduling  each  instruction.  The  sim¬ 
plest  case  is  when  every  instruction  uses  exactly  one  (explicit)  resource.  In  this  case 
deciding  whether  a  particular  instruction  can  be  scheduled  for  execution  requires  exa¬ 
mining  only  a  single  variable  at  a  single  instant  in  time.  If  an  instruction  uses  multiple 
resources  during  different  phases  of  its  execution,  then  it  becomes  necessary  to  examine 
multiple  variables  at  different  times  to  decide  when  an  instruction  can  be  issued 
without  subsequent  conflict  with  other  instructions.  The  need  to  check  resources  at 
different  times  causes  difficulty  during  the  scheduling  of  loops  because  decisions  made  at 
the  beginning  of  the  loop  must  be  sensitive,  to  conditions  at  the  end  of  the  loop  (of  the 
previous  iteration)  which  are  yet  unknown. 

In  our  target  machine  we  have  decided  to  allow  only  one  explicitly-scheduled 
resource  per  instruction  and  require  that  each  resource  be  capable  of  accepting  a  new 
instruction  per  clock.  This  means  that  if.  for  an  instruction  such  as  z  *  x  +  y .  the 
explicitly-scheduled  resource  is  the  adder,  then  the  register  file  containing  » ,  y .  and  z 
as  well  as  all  the  interconnecting  busses  must  all  be  implicitly-scheduled  resources.  In 
other  words,  dedicated  register  file  ports  and  busses  must  be  associated  with  the  adder. 
Furthermore,  the  adder  itself  must  be  a  simple  pipeline  with  no  shared  or  looping 
stages.  For  multifunctional  processors,  one  resource  per  instruction  implies  that  every 
function  must  take  the  same  amount  of  time  to  flow  through  the  pipeline,  otherwise  the 
output  bus  becomes  another  explicitly-scheduled  resource. 

Another  important  consideration  is  the  number  of  processor  classes,  and  their 
characteristics,  that  can  execute  a  particular  instruction.  From  a  hardware  utilization 
point  of  view  it  may  be  desirable  to  provide  both  a  fast  floating-point  adder  as  well  as 


87 


slower  microcod ed  floating-point  point  addition  on  a  multifunctional  processor.  Unfor¬ 
tunately  to  utilize  such  a  system  fully  the  compiler  must,  for  each  floating-point  add 
instruction,  choose  between  using  the  fast  adder  and  possibly  delaying  another  more 
critical  add  instruction  or  using  the  slower  multifunctional  processor  and  possibly 
delaying  several  more  critical  instructions  of  as  yet  unknown  classes.  Choices  of  this 
type  are  called  global  choices  since  one  decision  may  impact  other  choices  in  the  future. 
In  contrast,  choosing  among  several  dedicated  adders  of  identical  delay  is  called  a  local 
choice  since  this  decision  has  no  effect  on  choosing  a  schedule  time  for  other  instruc¬ 
tions. 

To  generate  high  quality  code,  a  compiler  must  occasionally  backtrack  and  recon¬ 
sider  earlier  decisions.  However  only  global  choices  must  be  reexamined.  Hence,  to 
reduce  compilation  time,  it  is  desirable  to  minimize  the  number  of  global  choices  that 
must  be  made  per  instruction.  In  our  target  machine  we  have  decided  to  partition  the 
processors  into  classes  and  bind  every  instruction  to  a  single  processor  class.  Processors 
of  a  given  class  are  therefore  identical  and  hence  it  is  a  local  choice  to  decide  which  one 
to  use.  With  this  constraint  the  only  global  choice  is  to  decide  when  to  execute  a  partic¬ 
ular  instruction. 

Oeterminacy  of  execution  time  is  another  important'  consideration.  Resolution  of 
conflicting  resource  requirements  at  compilation  time  is  highly  desirable  to  avoid  the 
cost,  in  both  hardware  and  run  time,  needed  for  arbitration  and  synchronization.  To 
achieve  this  resolution,  the  compiler  must  be  able  to  predict  the  exact  execution  time  of 
every  instruction.  Because  we  already  restrict  processors  to  be  iin«r  pipelines,  the  exe¬ 
cution  time  of  most  instructions  is  completely  deterministic.  The  time  for  a  memory 
reference,  however,  cannot  be  made  deterministic  because  of  memory  hank  conflicts 
and/or  concurrent  input/output  operations.  We  have  chosen  to  model  the  memory  sys- 
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tem  as  a  set  of  pipelines  whose  length  is  equal  to  the  memory  reference  time  in  clock 
cycles.  When  bank  conflicts  occur,  the  entire  machine  is  frozen  until  the  conflicts  are 
resolved. 

43.  Storage-Enhanced  Crossbar  Interconnect  Design 

As  shown  in  figure  4.1.  the  heterogeneous  processors  and  memory  pipelines  are 
interconnected  by  a  crossbar  with  embedded  storage  at  each  crosspoint.  The  storage- 
enhanced  crossbar  interconnect  was  chosen  because  it  simplifies  and/or  solves  a  number 
of  difficult  implementation  and  code-scheduling  issues.  The  concept  of  a  storage- 
enhanced  crossbar  interconnect  is  not  new.  and  many  of  its  advantages  have  been  docu¬ 
mented^].  We  now  briefly  review  some  of  these  advantages. 

In  the  previous  section  we  discussed  the  importance  of  minimizing  the  number  of 
explicitly-scheduled  resources.  The  crossbar  is  the  only  interconnection  network  that 
provides  dedicated  busses  for  every  processor  and  memory  and  a  dedicated  switch  for 
each  possible  connection.  Therefore  all  the  busses  and  switches  are  implicitly-scheduled 
resources.  Other  interconnection  schemes  using  shared  busses  and/or  switches  neces¬ 
sarily  introduce  additional  explicitly-scheduled  resources. 

A  similar  line  of  reasoning  leads  to  the  decision  to  embed  the  register  file  within 
the  crossbar  interconnect.  In  order  to  avoid  generation  of  additional  explicitly- 
scheduled  resources,  a  register  file  port  with  no  access  conflict  must  be  dedicated  to  each 
processor  port.  As  the  size  of  n  multiprocessor  system  increases,  it  becomes  impractical 
to  implement  a  single  centralized  multiported  register  file  with  the  required  number  of 
conflict-free  ports.  One  solution  is  to  decentralize  the  register  file  by  distributing  the 
file  memory  into  each  crosspoint  of  a  crossbar  interconnect. 

Referring  to  the  crossbar  in  figure  4.1.  the  register  read  data  busses  are  shown 
vertically  while  the  register  write  data  busses  are  shown  horizontally.  High  write 


v-'.'- 1  ■Vi'-'--'-''  •I'AV.’  '  ^ -.'-.VatVA 


•*  •- 


89 


bandwidth  is  provided  by  partitioning  the  register  file  and  associating  a  distinct  parti* 
tion  with  the  output  port  of  each  processor  and  memory  pipeline.  In  effect,  each  output 
port  is  connected  to  a  separate  file  memory  so  that  every  processor  and  memory  can 
simultaneously  write  into  the  register  file  without  conflict.  High  read  bandwidth  is 
provided  by  replicating  the  data  within  the  register  file  once  for  each  processor  and 
memory  read  port.  This  design  allows  an  arbitrarily  large  storage-enhanced  crossbar  to 
be  constructed  using  two-ported  random  access  memories  (RAM). 

Replicating  data  to  increase  register  file  read  bandwidth  increases  the  cost  of  the 
system  but  poses  no  problem  for  the  compiler  since  the  replications  can  be  made  archi¬ 
tecturally  transparent.  Partitioning  the  register  file  to  increase  write  bandwidth,  how¬ 
ever.  cannot  be  made  architecturally  transparent  and  hence  has  a  strong  impact  on  code 
generation.  For  the  machine  configuration  shown  in  figure  4.1.  the  architectural  view  of 
the  register  file  is  shown  in  figure  4.2.  assuming  for  this  example  that  the  RAM  at  each 
crosspoint  contains  only  four  words.  Each  processor  can  only  write  into  those  registers 
that  belong  to  the  partition  corresponding  to  the  row  in  the  crossbar  connected  to  that 
processor's  write  bus.  However,  since  data  is  replicated  across  all  register  modules  in  a 
row.  every  processor  can  read  registers  belonging  to  all  partitions.  Although  this  distri¬ 
buted  register  file  architecture  has  implementation  merits,  it  does  introduce  additional 
code  generation  problems.  We  shall  return  to  this  issue  in  section  4.6. 

Another  reason  for  embedding  the  register  file  within  the  crossbar  interconnect  is 
to  mitigate  the  high  cost  of  the  crossbar  network  in  terms  of  chip  count.  Here  we 
assume  that  the  physical  size  of  a  chip  package  is  determined  by  the  number  of  pins  and 
not  by  the  amount  of  logic  contained  within  the  chip.  Given  that  a  large  number  of 
pins  are  needed  at  each  crosspoint  to  interconnect  the  two  orthogonal  word-wide  data 
busses,  the  addition  of  a  small  RAM  to  each  simple  crosspoint  switch  should  not 


4.4,  Control  Unit  Design 


The  THUMPER  is  controlled  by  a  single  control  unit  and  synchronized  by  a  single 
system-wide  clock.  This  approach  has  a  number  of  advantages.  Using  a  centralized 
clock  and  a  global  control  unit  leads  to  a  highly  deterministic  system  whose  detailed 
run  time  behavior  can  be  accurately  determined  at  compile  time.  The  compiler  can 
optimize  the  code  by  knowing  the  actual  behavior  of  the  machine,  instead  of  knowing 
only  the  statistical  behavior.  Another  advantage  of  using  a  centralized  control  unit  for 
a  multiprocessor  is  the  elimination  of  run-time  arbitration  and  synchronization  over¬ 
head  for  interprocessor  communication. 

The  THUMPER  uses  a  wide  horizontal  instruction  format  as  shown  in  figure  4.3. 
This  instruction  format  is  very  similar  to  that  of  horizontal  microcode,  hence  architec¬ 
tures  based  on  this  type  of  synchronous  multiprocessor  organization  are  also  called  hor¬ 
izontal  architecture r[9. 30. 31].  A  separate  field  is  allocated  to  each  processor,  plus  an 
additional  field  for  branch  specification  or  an  immediate  constant.  Each  processor  field 
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Figure  4.3.  Horizontal  instruction  format. 


contains  an  opcode  specific  to  that  processor  class.  Two  register  specifier  fields  are  used 
to  address  input  operands  resident  in  the  register  file.  Another  register  specifier  field  is 
used  to  address  the  result  operand.  Note  that  the  input  register  specifier  fields  are  large 
enough  to  address  every  partition  in  the  register  file  while  the  output  register  specifier 
field  is  large  enough  to  address  only  the  one  partition  that  is  writable  by  that  processor. 

We  chose  to  use  an  almost  purely  horizontal  instruction  format  rather  than  a 
more  vertical  format  because  we  have  found  that  the  flexibility  offered  by  horizontal 
instruction  formats  is  essential  for  the  exploitation  of  parallelism  in  a  wide  range  of 
application  programs.  As  discussed  in  chapter  3.  more  highly  encoded  vertical  instruc¬ 
tion  formats  such  as  those  generally  employed  by  vector  architectures  are  unable  to 
exploit  much  of  the  parallelism  available  in  those  program  loops  that  involve  multinode 
recurrences.  Vertical  instruction  formats  are  also  unsuitable  for  the  exploitation  of 
parallelism  available  in  scalar  program  fragments. 

The  control  unit  is  organized  as  a  linear  pipeline  whose  length  is  equal  to  the  pro¬ 
gram  memory  access  time  plus  the  time  needed  to  decode  the  instruction.  The  program 
memory  is  interleaved  to  supply  one  instruction  per  clock  cycle.  It  is  further  inter¬ 
leaved  so  that  most  of  the  time  a  branch  to  an  arbitrary  bank  will  experience  little  or 
no  delay.  When  a  memory  bank  conflict  does  occur,  the  simplest  approach  is  to  stop  the 
processor  until  the  conflict  is  resolved. 

We  have  chosen  not  to  include  an  instruction  cache  in  our  proposed  target 
machine.  To  be  fast,  caches  must  be  constructed  using  a  relatively  expensive  technol¬ 
ogy.  and  the  physical  size  of  the  cache  must  be  kept  small  in  order  to  reduce  cost  and 
minimize  the  physical  separation  between  the  cache  and  the  processor{36].  However,  a 
major  reason  for  incorporating  an  instruction  cache  in  a  machine  is  to  reduce  the  time 
needed  for  a  taken  conditional  branch.  Therefore  rather  than  using  an  instruction 


cache,  we  have  chosen  to  rely  on  compile-time  code  scheduling  technology  to  minimize 
the  performance  impact  of  long  branch  time. 

A  linear  instruction  fetch  and  decode  pipeline  lends  itself  naturally  to  an  architec¬ 
ture  with  delayed  branches[5. 6.  7].  Our  scheduling  techniques  are  designed  to  take  full 
advantage  of  relatively  long  delay  branches,  under  the  assumption  that  to  achieve  high 
clock  speed  it  is  necessary  to  partition  the  instruction  fetch  and  decode  pipeline  into 
multiple  segments  with  fine  granularity.  The  DTS  technique  performs  extensive  code 
rearrangement  to  allow  a  sequence  of  delayed  branches  to  be  overlapped,  thus  reducing 
the  average  delay  of  conditional  branches  in  scalar  code. 

4J.  Machine  Parameters 

We  have  described  an  expandable  multiprocessor  organization  and  discussed  the 
rationale  behind  some  of  the  design  decisions.  The  significance  of  this  organization  is 
that  it  can  be  efficiently  implemented  using  current  technology  and  it  can  be  completely 
characterized  by  a  small  number  of  parameters.  The  ability  to  capture  concisely  all  the 
constraints  imposed  by  the  machine  organization  has  a  direct  impact  on  the  develop¬ 
ment  of  scheduling  techniques,  both  in  simplifying  the  algorithms  as  well  as  in  improv¬ 
ing  the  efficiency  of  these  algorithms.  This  section  describes  each  of  the  machine  param¬ 
eters. 

The  universe  of  instruction  opcodes  is  given  by  the  set  F.  This  set  of  opcodes 
defines  the  functionality  of  the  instruction  set  architecture.  Elements  of  F  include  the 
usunl  integer  and  floating  point  arithmetic  operations,  logical  operations,  memory  opera¬ 
tions.  etc.  The  exact  membership  of  F  is  a  relatively  low-level  design  issue,  and  is 
beyond  the  scope  of  this  thesis.  We  do.  however,  require  that  F  include  the  guarded 
stoxe  and  guarded  jump  instructions  described  in  chapter  2. 


The  set  C  defines  the  processor  classes.  An  element  of  C  can  be  a  multifunctional 
processor,  such  as  an  integer  unit  that  can  perform  all  the  normal  arithmetic  and  logical 
functions  (e.g.  an  ALU).  Elements  of  C  can  also  be  unifunctional  processors,  such  as 
specialized  floating  point  add  and  multiply  pipelines.  Since  we  require  a  disjoint  parti¬ 
tion  of  functionality  for  different  processor  classes,  a  function  /j  can  be  defined  to  map 
each  instruction  opcode  i  onto  one  particular  processor  class. 

Multiple  processors  of  the  same  class  can  be  incorporated  for  increased  parallel¬ 
ism.  However  we  require  that  all  processors  belonging  to  a  particular  class  be  function¬ 
ally  identical.  The  number  of  replicated  processing  units  of  a  particular  class  c  is  given 

by  «e  • 

Since  the  organization  of  each  processor  is  constrained  to  be  a  linear  pipeline  and 
since  every  instruction  is  constrained  to  flow  through  every  pipeline  stage,  the  temporal 
characteristics  of  a  processor  are  completely  specified  by  the  number  of  pipeline  stages. 
This  number  is  given  by  se ,  where  c  is  a  processor  class.  Note  that  by  modeling  the 
memory  system  and  the  instruction  fetch  and  decode  process  as  linear  pipelines,  we  can 
define  a  "memory"  processor  class  and  a  "branch"  processor  class  to  model  the  schedul¬ 
ing  constraints  imposed  by  the  operation  of  these  resources.  The  parameters  for  these 
processor  classes  are  exactly  the  same  as  for  any  other  processor  class,  namely  s  and  u . 

Pipelining  within  the  storage-enhanced  crossbar  interconnect  can  be  handled  sim¬ 
ply  by  treating  the  interconnect  pipeline  as  an  extension  of  the  processor  pipeline,  since 
the  resources  within  the  crossbar  are  all  implicitly-scheduled  resources.  Therefore  the 
additional  delay  within  the  interconnect  can  be  charged  to  se . 

The  parameters  /  .  ue .  and  se  are  sufficient  for  describing  the  processing  part  of  a 
THUMPER  implementation.  The  register  file  is  characterized  by  the  size  of  the  RAM 
within  each  crosspoint  cell.  Note  that  there  is  no  need  to  specify  the  number  of  seg- 


meats  in  the  register  file  since  that  is  implicitly  specified  by  the  number  of  processors, 
which  is  equal  to  £ue . 

e«C 


4.6.  Register  Assignment  Issues 

As  we  alluded  to  in  section  4.3.  the  distributed  register  file  introduces  certain  code 
generation  problems  that  do  not  arise  in  a  conventional  centralized  register  file.  Refer¬ 
ring  to  figure  4.2,  the  problem  with  this  register  file  architecture  is  that  to  fetch  a  value 
it  is  necessary  to  know  which  partition  the  value  is  in.  i.e.  to  know  which  processor 
generated  that  value.  Sometimes,  however,  it  is  impossible  to  know  at  compile  time 
which  processor  will  generate  a  particular  value,  as  shown  in  the  following  example. 

if  (...)  x  =»  A[t]; 
else  x  »  b  *  c ; 
z  58  x  +y : 

If  memory  fetches  are  handled  by  one  processor  while  arithmetic  operations  are  handled 
by  another  processor,  the  value  x  must  reside  in  different  partitions  depending  on  the 
outcome  of  the  if -statement.  This  uncertainty  causes  problems  for  the  compiler  when 
it  tries  to  generate  code  for  z  »  x  +  y  since  the  the  location  of  x  cannot  in  general  be 
determined  at  compile  time.  Note,  however,  that  this  uncertainty  can  only  occur  when 
the  basic  block  containing  z  *  x  +  y  has  two  predecessor  blocks. 


To  solve  this  problem  we  have  elected  to  constrain  the  compiler  to  use  registers 
only  for  temporaries  within  a  tree  of  basic  blocks.  Because  each  basic  block  within  a 
tree  (except  the  root)  has  exactly  one  predecessor  block,  it  is  always  possible  to  identify 
uniquely  the  register  file  partition  that  a  temporary  value  resides  in.  During  tree  tran¬ 
sitions  all  temporary  values  must  be  stored  in  memory.  Therefore  the  fact  that  there 
are  multiple  predecessor  blocks  branching  to  the  root  node  of  a  tree  does  not  cause  any 
problems. 
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The  scheduling  technique  proposed  in  chapter  2  directly  implements  this  idea  by 


representing  programs  as  decision  trees,  hence  register  assignment  with  the  DTS  tech¬ 


nique  is  straightforward.  The  simple  loop  scheduling  technique  proposed  in  chapter  3 


can  also  implement  this  idea  because  an  unrolled  simple  loop  forms  a  highly  skewed 


decision  tree.  A  method  of  register  assignment  for  the  SLS  technique  is  discussed 


below. 


The  example  optimal  throughput  schedule  produced  by  the  SLS  algorithm  in 


figure  3.24  has  length  1—4.  At  any  one  time  there  are  up  to  four  iterations  being  exe¬ 


cuted  concurrently.  Therefore  each  temporary  value  name  shown  in  figure  3.24 


requires  four  physical  registers  to  accommodate  the  four  distinct  values  that  exist  con¬ 


currently.  Since  each  instruction  operand  specifier  must  reference  four  different  regis¬ 


ters  at  different  times,  there  is  a  problem  with  name  binding. 


An  innovative  hardware  addressing  scheme  to  solve  this  name  binding  problem 


has  been  proposed  by  Rau[35].  This  approach  uses  hardware  queues  with  the  capability 


of  deleting  any  element  within  the  queue.  Queues  are  used  at  each  crosspoint  of  the 


crossbar  to  allow  relative  register  addressing,  thus  implementing  run-time  dynamic 


name  binding.  Although  elegant,  the  use  of  hardware  queues  with  random  deletion 


capability  rather  than  RAM  to  implement  the  distributed  register  file  significantly 


increases  the  complexity  of  the  system  and  can  lead  to  additional  delays  in  transferring 


data  through  the  register  file.  Thus  this  approach  has  a  negative  impact  on  both  perfor¬ 


mance  and  cost  -effectiveness. 


We  advocate  a  much  simpler  approach  that  uses  additional  program  storage  space 


to  solve  the  name  binding  problem  statically.  Our  solution  involves  unrolling  the  loop 


l  times,  where  l  is  the  number  of  overlapped  iterations.  Figure  4.4  shows  one  iteration 


of  the  schedule  from  figure  3.24.  Each  instruction  is  shown  in  detail  to  illustrate  regis- 
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ter  assignments.  The  JMP  processor  class  instructions  have  been  omited  since  they  are 
not  germane  to  this  discussion. 

The  register  names  shown  in  figure  4.4  have  been  given  subscripts  to  distinguish 
among  the  four  physical  registers.  Note  that  instructions  a .  /  .  and  k  reference  regis¬ 
ters  with  subscript  3.  indicating  that  these  values  are  to  come  from  previous  iterations. 
The  schedule  shown  in  figure  4.4  represents  the  first  of  the  four  overlapped  iterations. 
The  schedule  for  the  remaining  three  iterations  can  be  derived  by  successively 

(i)  rotating  the  original  schedule  by  some  multiple  of  p  clock  cycles,  where  p  is  the 
initiation  interval,  and 

(ii)  incrementing  the  register  subscripts  by  one.  modulo  l . 

Figure  4.5  shows  the  complete  schedule  with  four  overlapped  iterations.  Note  that 
every  instruction  now  has  a  unique  symbolic  register  address  that  can  be  mapped  into  a 
unique  physical  register  address.  Therefore  the  register  file  can  be  implemented  using 
an  ordinary  RAM. 

Compared  to  Rau's  dynamic  name  binding,  our  approach  of  loop  unrolling  to 
achieve  static  name  binding  requires  l  times  more  code  space  per  loop.  However,  we 
believe  that  loop  unrolling  is  a  more  cost-effective  solution  because  it  reduces  the  com¬ 
plexity  of  the  machine.  Reduced  complexity  allows  the  clock  speed  to  be  increased  and 
also  reduces  the  design  cost  of  the  machine. 

4.7.  Architectural  Considerations  for  Delayed  Branches 

The  architecture  of  branch  instructions  has  a  strong  impact  on  the  complexity  of 
code  optimization  techniques.  The  THUMPER  instruction  set  includes  delayed  branches 
with  guard  expressions  as  described  in  chapter  2.  This  section  discusses  some  architec¬ 
tural  considerations  for  branch  instructions. 


Recall  that  in  the  description  of  the  decision  tree  scheduling  technique  in 
chapter  2.  guarded  jumps  that  branch  from  an  exterior  block  of  a  tree  to  the  root  of 
another  tree  are  scheduled  in  priority  order  just  like  any  other  instruction.  The  prob¬ 
lem  with  this  strategy  is  as  follows.  Suppose  branches  have  delay  k .  This  means  that 
the  terminal  branch  on  a  path  through  the  tree  should  be  scheduled  exactly  k  cycles 
prior  to  the  end  of  the  path.  However,  before  the  path  is  completely  scheduled,  the 
compiler  cannot  determine  how  long  the  path  is  going  to  be.  Thus  until  after  it  has  gen¬ 
erated  the  entire  path  schedule,  the  compiler  cannot  determine  when  the  terminal 
branch  instruction  should  be  scheduled. 

One  possible  solution  is  as  follows.  Once  the  entire  path  schedule  has  been  gen¬ 
erated.  the  compiler  can  go  backward  k  cycles  and  insert  the  terminal  branch  instruc¬ 
tion.  Unfortunately  there  is  no  guarantee  that  no  other  instruction  has  been  scheduled, 
at  the  required  functional  unit,  k  cycles  from  the  end  of  the  path.  In  such  a  case  the 
compiler  could  insert  the  branch  instruction  k  —1  cycles  prior  to  the  end  of  the  path  and 
delay  the  remaining  k—l  instructions  by  one  cycle.  This  solution  may  be  acceptable  if 
the  number  of  instructions  that  can  be  issued  per  cycle  is  very  small,  such  as  one 
instruction  per  cycle. 

However,  for  highly  concurrent  THUMPER  configurations  that  issue  many 
instructions  per  cycle,  this  solution  is  inefficient  because  no  other  instruction  can  be 
scheduled  for  the  cycle  devoted  to  the  inserted  branch  instruction.  Moreover,  this  solu¬ 
tion  may  introduce  inefficiencies  into  other  paths  through  the  decision  tree  because  the 
active  code  block  at  k  —1  cycles  prior  to  the  end  of  one  path  may  not  be  the  exterior 
block  for  that  path,  but  instead  may  be  an  interior  block  shared  by  several  paths.  In 
this  case  the  extra  cycle  introduced  to  accommodate  the  terminal  branch  instruction  for 
one  path  causes  delays  in  all  other  paths  that  share  the  interior  block  into  which  the 
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branch  instruction  is  inserted. 

Instead,  we  solve  this  problem  by  introducing  an  extra-delay  parameter  in  branch 
instructions.  The  extra-delay  parameter  specifies  the  number  of  additional  cycles  that 
should  be  added  to  the  normal  delay  of  the  branch.  The  availability  of  this  extra-delay 
parameter  greatly  simplifies  the  DTS  technique  and  eliminates  performance  degradation 
due  to  insertion  of.  terminal  branch  instructions  in  highly  concurrent  THUMPER 
configurations.  With  this  parameter,  the  DTS  technique  simply  schedules  terminal 
branch  instructions  without  considering  how  long  the  path  may  be.  Once  the  path  has 
been  scheduled,  the  appropriate  extra-delay  can  be  computed  and  written  back  into  the 
branch  instruction.  Naturally,  if  it  turns  out  that  the  terminal  branch  instruction  is 
less  than  k  cycles  from  the  end  of  the  path,  then  the  path  must  be  padded  with  the 
appropriate  number  of  no-operation  instructions. 

The  implementation  of  the  extra-delay  parameter  is  straightforward.  The  branch 
target  address  is  saved  in  a  register  along  with  the  value  of  the  extra-delay  parameter. 
The  extra-delay  is  counted  down  by  hardware,  and  the  branch  target  address  is 
transferred  into  the  program  counter  when  the  extra-delay  count  becomes  zero. 

The  extra-delay  parameter  also  simplifies  the  simple  loop  scheduling  technique 
described  in  chapter  3.  The  SLS  technique  schedules  branches  strictly  based  on  resource 
availability  and  data-dependency  constraints,  without  consideration  for  the  initiation 
interval.  Therefore  the  loop-completion  branch  can  be  scheduled  more,  or  fewer,  then 
k  cycles  from  the  end  of  the  modulo  reservation  table. 

If  the  loop-completion  branch  is  scheduled  more  then  k  cycles  from  the  end  of  the 
MRT,  the  extra-delay  parameter  can  be  used  to  increase  the  branch  delay.  If  the  loop- 
completion  branch  is  scheduled  fewer  than  k  cycles  from  the  end  of  the  MRT.  the  solu¬ 
tion  is  to  concatenate  one  or  more  copies  of  the  schedule  until  the  loop-completion 


branch  is  no  fewer  than  k  cycles  from  the  end  of  the  MRT,  and  then  use  the  extra¬ 
delay  parameter  as  appropriate.  Note  that  multiple  copies  of  a  complete  schedule  such 
as  the  one  shown  in  figure  4.5  can  be  concatenated  without  change.  Therefore,  once  the 
SLS  algorithm  has  found  a  valid  schedule  for  a  loop,  the  branch  issues  can  be  quickly 
resolved. 
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CHAPTER  5 

CONCLUSIONS 


5.1.  Summary  of  Results 

We  have  shown  that  the  performance  of  scalar  code  can  be  improved  through  the 
use  of  an  integrated  design  philosophy  in  which  the  machine  organization,  instruction 
set  architecture,  and  compiler  code  generation  techniques  are  developed  simultaneously. 
By  concentrating  our  research  efforts  on  the  general  nature  of  scalar  code,  we  have 
insured  that  our  techniques  are  applicable  to  a  wide  range  of  applications,  including  job 
loads  that  are  dominated  by  nonnumerical  and  symbolic  computations.  The  results  of 
this  research  suggests  that  cost-effective  techniques  can  be  used  to  achieve  significant 
speedup  in  the  context  of  general  purpose  computer  systems. 

Chapter  2  described  the  decision  tree  scheduling  technique  for  handling  conditional 
branch  intensive  scalar  code.  The  DTS  technique  is  a  very  general  and  robust  code  gen¬ 
eration  heuristic  that  efficiently  utilizes  concurrency  in  the  form  of  parallelism  and 
pipelining  to  reduce  the  average  execution  time  of  a  tree  of  basic  blocks.  A  key  concept 
of  the  DTS  technique  is  the  use  of  guarded  jump  instructions  to  allow  overlapped  exe¬ 
cution  of  multiple  conditional  branches,  thus  reducing  the  average  delay  of  a  condi¬ 
tional  branch  below  that  which  can  be  provided  by  hardware.  We  have  shown  that  the 
DTS  technique,  when  combined  with  judicial  code  replication,  achieves  significant  levels 
of  speedup  on  a  variety  of  example  program  modules. 

For  a  sufficiently  large  decision  tree  and  a  sufficiently  parallel  machine,  the  DTS 
technique  with  guarded  jumps  and  stores  and  selective  code  replication  produces 
schedules  that  approach  the  theoretical  speedup  achievable  on  a  highly  parallel,  no 
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overhead  dataflow  machine.  Thus  the  DTS  technique  can  be  viewed  as  a  static  dataflow 
technique  that  captures  many  of  the  advantages  of  dataflow  processing  without  incur¬ 
ring  the  inevitable  overhead  associated  with  dynamic  dataflow  processing. 

Chapter  3  focuses  on  the  problem  of  code  generation  for  recurrence-intensive  loop 
code.  With  the  exception  of  job  loads  dominated  by  numerical  computations,  the  use  of 
linked  data  structures  is  pervasive  in  most  general  job  loads.  The  traversal  of  linked 
data  structures  give  rise  to  numerous  recurrences  through  memory,  reducing  the 
effectiveness  of  vector  and  multiprocessor  architectures.  Horizontal  architectures  offer 
improved  performance  and  cost-effectiveness;  however  horizontal  architectures  require 
sophisticated  code  generation  techniques. 

The  the  simple  loop  scheduling  technique  described  in  chapter  3  generates  optimal 
throughput  schedules  for  innermost  loops  without  nested  conditional  statements.  The 
SLS  technique  is  an  adaptation  and  extension  of  the  theory  of  optimal  design  of 
hardware  pipelines.  We  have  shown  that  the  SLS  algorithm  produces  optimal 
throughput  schedules  in  minimal  time.  i.e.  the  complexity  of  the  SLS  algorithm  itself  is 
optimal. 

Architectural  support  for  the  proposed  scheduling  techniques  is  the  subject  of 
chapter  4.  In  this  chapter  we  describe  a  highly  concurrent  parametric  machine  model 
that  was  used  to  develop  the  DTS  and  SLS  techniques.  We  discuss  the  rationale  behind 
the  design  decisions  that  lead  to  the  choice  of  machine  organization  and  architecture. 
We  also  dismiss  several  related  practical  code  generation  problems  including  register 
assignment  issues. 

In  conclusion,  this  thesis  has 

(i)  pointed  out  some  of  the  principle  problems  that  must  be  solved  in  order  to  achieve 

high-speed  general-purpose  computing. 


(ii)  proposed  new  code  optimization  techniques  for  solving  some  of  these  problems, 
and 

(iii)  proposed  a  machine  organization  that  supports  these  code  optimization  techniques 
and  can  be  implemented  using  current  technology. 

5.2.  Suggestions  for  Future  Research 

Although  we  have  addressed  some  of  the  key  problems  of  high-speed  general- 
purpose  computing,  solutions  to  many  more  problems  are  necessary  before  a  practical 
implementation  of  a  computer  system  employing  the  techniques  proposed  in  this  thesis 
can  be  realized.  Some  of  these  problems  are  listed  below. 

Multilevel  memory  hierarchies  are  a  standard  feature  of  modem  general-purpose 
computer  systems.  Throughout  our  research  we  have  ignored  the  problem  of  page-fault 
handling.  Since  our  techniques  are  targeted  at  large-scale  application  programs  that 
require  considerable  computing  and  memory  resources,  a  high-performance  solution  to 
the  page-fault  problem  is  essential  in  order  to  handle  such  applications. 

Modem  programming  methodologies  promote  the  use  of  many  small  procedures. 
We  have  ignored  the  problem  of  speeding  up  procedure  calls  in  our  research.  The  DTS 
technique  can  be  easily  extended  to  convert  procedure  calls  to  in-line  expansion  of  the 
called  procedure,  at  the  cost  of  further  increasing  the  amount  of  replicated  code.  How¬ 
ever.  it  is  much  more  desirable  if  in-line  expansion  can  be  limited  only  to  those  execu¬ 
tion  paths  of  a  procedure  that  have  a  high  probability  of  being  taken  in  the  context  of 
the  specific  procedure  call  site.  The  use  of  intelligent  procedure  expansion  techniques  is 
expected  to  be  crucial  to  the  achievement  of  high  performance  for  object-oriented  pro¬ 
gramming  methodologies  that  rely  on  extensive  use  of  numerous  small  procedures. 


As  an  extension  of  the  guarded  store  and  jump  features,  some  jump  instructions 
can  actually  be  entirely  eliminated  by  making  subsequent  guard  expressions  more  com¬ 
plex.  This  possibility  poses  an  interesting  new  optimization  problem  with  a  variety  of 
tradeoffs,  including  code  space  and  rescheduling  opportunities.  Further,  rescheduling 
opportunities  arise  from  attempting  to  use  more  detailed  information  about  segment- 
by-segment  pipeline  operations  that  may  relax  dependency  constraints  on  scheduling. 

Finally,  throughout  this  research  we  have  concentrated  exclusively  on  optimiza¬ 
tion  techniques  that  exploit  static  information  about  the  behavior  of  programs.  It  is 
well-known  that  much  more  precise  information,  and  therefore  superior  optimization 
results,  can  be  achieved  if  the  dynamic  behavior  of  a  program  is  taken  into  account.  We 
believe  that  static  and  dynamic  optimization  techniques  are  complementary,  and  the 
best  solution  in  a  system  context  should  involve  a  combination  of  both  types  of  tech¬ 
niques. 


Vl'VA'tM  A  1  l’.W!l  g  rSEWST*  iL1 


KRl* 


107 


REFERENCES 


[1]  R.  M.  Russell,  “The  CRAY-1  Computer  System,"  Communications  of  th e  ACM. 
vol.  21.  pp.  63-72,  Jan.  1978. 

[2]  D.  W.  Anderson,  F.  J.  Sparacio.  and  R.  M.  Tomasulo.  "The  IBM  System/360 
Model  91:  Machine  Philosophy  and  Instruction-Handling.”  IBM  Journal  of 
Research  and  Development,  vol.  11.  pp.  8-24,  Jan.  1967. 

[3]  R.  M.  Tomasulo,  "An  Efficient  Algorithm  for  Exploiting  Multiple  Arithmetic  Un¬ 
its,"  IBM  Journal  of  Research  and  Development,  vol.  11.  pp.  25-33.  Jan.  1967. 

[4]  J.  E.  Smith.  "A  Study  of  Branch  Prediction  Strategies,"  Proc.  8th  Annual  Interna¬ 
tional  Symposium  on  Computer  Architecture,  pp.  135-148, 1981. 

[5]  G.  Radin,  "The  801  Minicomputer."  Proc.  Symposium  on  Architectural  Support  for 
Programming  Languages  and  Operating  Systems,  pp.  39-47,  Mar.  1982. 

[6]  D.  A.  Patterson  and  C.  H.  Sequin.  "RISC  I:  A  Reduced  Instruction  Set  VLSI  Com¬ 
puter,"  Proc.  8th  Annual  Symposium  on  Computer  Architecture,  pp.  443-458, 
May  1981. 

[7]  J.  Hennessy.  N.  Jouppi.  F.  Baskett.  T.  Gross,  and  J.  Gill.  "Hardware/Software 
Tradeoffs  for  Increased  Performance."  Proc.  Symposium  on  Architectural  Support 
for  Programming  languages  and  Operating  Systems,  pp.  2-11.  Mar.  1982. 

[8]  J.  Hennessy.  N.  Jouppi.  S.  Przybylski,  C  Rowen.  T.  Gross.  F.  Baskett.  and  J.  Gill. 
"MIPS:  A  Microprocessor  Architecture,"  Proc.  15th  Annual  Workshop  on  Mi¬ 
croprogramming.  vol.  13,  pp.  17-22,  Oct.  1982. 

[9]  J.  A.  Fisher.  "Very  Long  Instruction  Word  Architecture  and  the  ELI-512,”  Proc. 
10th  Annual  International  Symposium  on  Computer  Architecture,  pp.  140-150. 
1983. 

[10]  E.  M.  Riseman  and  C.  C.  Foster.  "The  Inhibition  of  Potential  Parallelism  by  Con¬ 
ditional  Jumps.”  IEEE  Transactions  on  Computers.  voL  C-21.  pp.  1405-1411. 
Dec.  1972. 

[11]  J.  L.  Hennessy.  "VLSI  Processor  Architecture.”  IEEE  Transactions  on  Computers. 
vol.  C-33.  pp.  1221-1246.  Dec.  1984. 

[12]  D.  W.  Clark  and  H.  M.  Levy,  "Measurement  and  Analysis  of  Instruction  Use  in 
the  VAX-1 1/780,”  Proc.  9th  Annual  Symposium  on  Computer  Architecture,  pp. 
9-17.  Apr.  1982. 

[13]  T.  R.  Gross  and  J.  L.  Hennessy.  "Optimizing  Delayed  Branches,"  Proc.  15th  Annu¬ 
al  Workshop  on  Microprogramming,  vol.  13,  pp.  114-120,  Oct.  1982. 

[14]  P.  M.  Kogge.  The  Architecture  of  Pipelined  Computers.  New  York:  Hemisphere 
Publishing  Corporation.  1981. 

[15]  E.  W.  Dijkstra.  "Guarded  Commands.  Nondeterminacy,  and  Formal  Derivation 
of  Programs.”  Communications  of  the  ACM.  vol.  18.  pp.  453-457,  Aug.  1975. 

[16]  D.  J.  Kuck,  The  Structure  of  Computers  and  Computations.  New  York:  John  Wiley 
and  Sons.  1978. 


[17]  J.  E.  Thornton.  "Parallel  Operation  in  the  Control  Data  6600,"  Proc.  AFIPS 
Conference,  vol.  26,  pp.  33-40,  1964. 

[18]  E.  G.  Coffman.  Computer  and  Job-Shop  Scheduling  Theory.  New  York:  Wiley. 
1976. 

[19]  H.  Kasahara  and  S.  Narita.  "Practical  Multiprocessor  Scheduling  Algorithms  for 
Efficient  Parallel  Processing."  IEEE  Transactions  on  Computers,  vol.  C-33.  pp. 
1023-1029,  Nov.  1984. 

[20]  J.  F.  Thorlin.  "Code  Generation  for  PIE  (Parallel  Instruction  Execution)  Comput¬ 
ers."  Proc.  Spring  Joint  Computer  Conference,  pp.  641-643. 1967. 

[21]  J.  A.  Fisher.  "Trace  Scheduling:  A  Technique  for  Global  Microcode  Compaction." 
TF.F.F.  Transactions  on  Computers,  vol.  C-30.  pp.  478-490,  Jul.  1981. 

[22]  E.  W.  Davis.  Jr..  "A  Multiprocessor  for  Simulation  Applications."  Dept,  of  Com¬ 
puter  Science  Rep.  UIUCDCS-R-72-527,  University  of  Illinois  at  Urbana- 
Champaign.  Urbana.  IL.  1972. 

[23]  J.  E.  Smith,  "Decoupled  Access/Execute  Computer  Architectures."  Proc.  9th  An¬ 
nual  International  Symposium  on  Computer  Architecture,  pp.  113-119.  Apr.  1982. 

[24]  Cray-1  Reference  Manual .  Minneapolis:  Cray  Research  Inc..  1976. 

[25]  F.  H.  McMahon.  FORTRAN  CPU  Performance  Analysis.  Livermore.  CA: 
Lawrence  Livermore  Laboratories.  1972. 

[26]  S.  Weiss  and  J.  E.  Smith.  "Instruction  Issue  Logic  in  Pipelined  Supercomputers." 
IEEE  Transactions  on  Computers.  voL  C-33.  pp.  1013-1022.  Nov.  1984. 

[27]  S.-C.  Chen  and  D.  J.  Kuck.  "Time  and  Parallel  Processor  Bounds  for  Linear  Re¬ 
currence  Systems."  IEEE  Transactions  on  Computers,  vol.  C-24.  Jul.  1975. 

[28]  D.  J.  Kuck  and  R.  A.  Stokes.  "The  Burroughs  Scientific  Processor  (BSP)."  IEEE 
Transactions  on  Computers,  vol.  C-31.  pp.  363-376.  May  1982. 

[29]  R.  G.  Cytron.  "Compile-Time  Scheduling  and  Optimization  for  Asynchronous 
Machines."  Dept,  of  Computer  Science  Rep.  UIUCDCS-R-84-1 177.  University  of 
Illinois  at  Urbana-Champaign.  Urbana.  IL,  1984. 

[30]  A.  E.  Charlesworth.  "An  Approach  to  Scientific  Array  Processing:  the  Architec¬ 
tural  Design  of  the  AP-120B/FPS-164  Family."  IEEE  Computer.  voL  14.  pp.  18- 
27.  Sep.  1981. 

[31]  B.  R.  Rau.  C.  D.  Glaeser.  and  R.  L.  Picard.  "Efficient  Code  Generation  for  Horizon¬ 
tal  Architectures:  Compiler  Techniques  and  Architectural  Support."  Proc.  9th 
Annual  International  Symposium  on  Computer  Architecture,  pp.  131-139. 1982. 

[32]  S.  Even.  Graph  Algorithms.  Maryland:  Computer  Science  Press.  1979. 

[33]  J.  H.  Patel  and  E.  S.  Davidson,  "Improving  the  Throughput  of  a  Pipelines  by 
Insertion  of  Delays."  Proc.  3rd  Annual  Symposium  on  Computer  Architecture,  pp. 
159-164. 1976. 

[34]  A.  Y.  Aho  and  J.  D.  Ullman.  Principles  of  Compiler  Design.  Reading.  Mass.: 
Addison-Wesley.  1977. 

[35]  B.  R.  Rau.  P.  J.  Kuekes.  and  C.  D.  Glaeser.  "A  Statically  Scheduled  VLSI  Inter¬ 
connect  for  Parallel  Processors."  in  VLSI  Systems  and  Computations.  Computer 
Science  Press,  pp.  389-395, 1981. 


Peter  Yan-Tek  Hsu  was  bom  on  October  16.  1958  in  Hong  Kong.  He  left  high 
school  after  the  sophomore  year  to  attend  the  University  of  Minnesota  and  received  a 
Bachelor  of  Computer  Science  degree  in  1979.  He  then  entered  the  University  of  Illinois 
and  received  a  Master  of  Science  in  Computer  Science  in  1982.  Upon  completion  of  the 
Doctor  of  Philosophy  degree,  he  will  join  I.B.M.  Research  in  Yorktown  Heights.  New 
York. 

During  his  undergraduate  studies  at  the  University  of  Minnesota.  Peter  Hsu  was  a 
teaching  assistant  at  the  Department  of  Computer  Science  from  1976  to  1979  and  also  a 
research  assistant  at  the  Department  of  Psychology  from  1977  to  1979.  He  was 
employed  by  Sperry  Univac  in  the  summer  of  1979.  While  pursuing  graduate  studies 
at  the  University  of  Illinois,  he  was  a  teaching  assistant  at  the  Department  of  Computer 
Science  from  1979  to  1980.  and  a  research  assistant  at  the  Coordinated  Science  Labora¬ 
tory  from  1980  to  1985. 


